[Libreoffice-bugs] [Bug 104597] Text runs of RTL scripts (e.g. Arabic, Hebrew, Persian) from imported PDF are reversed, PDFIProcessor::mirrorString not behaving

bugzilla-daemon Wed, 14 Jul 2021 21:58:58 -0700

https://bugs.documentfoundation.org/show_bug.cgi?id=104597


--- Comment #45 from Kevin Suo <suokunl...@126.com> ---
(In reply to V Stuart Foote from comment #44)

Well, below is my observation these days. I may be wrong, but I think these are
helpful to those who are interested:

1. The "mirrorString" function is never hit because isRTL is always false.

2. The "isRTL" is always false because the following code never returned true:
'''
    if( xCC.is() )
    {
        for(int i=1; i< elem.Text.getLength(); i++)
        {
            css::i18n::DirectionProperty nType =
static_cast<css::i18n::DirectionProperty>(xCC->getCharacterDirection( str, i
));
            if ( nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT          
||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_ARABIC   
||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_EMBEDDING
||
                 nType == css::i18n::DirectionProperty_RIGHT_TO_LEFT_OVERRIDE
                )
                isRTL = true;
        }
    }
'''
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?#110

3. This "if" block never returned true because "getCharacterDirection" in
XCharacterClassification requires an OUString with a length > 1, whereas only
one Arabic character is passed to it. If the length is 1 which is the same as
the nPos, then this function returns 0 directly without doing any detection of
the direction (it is impossible to detect the direction if only one Arabic
character is provided).
See:
https://opengrok.libreoffice.org/xref/core/i18npool/source/characterclassification/cclass_unicode.cxx?#128

4. The reason why only one Arabic character is passed to getCharacterDirection
may be due to the sdext.pdfimport code failed to combine those single
characters (as produced by xpdfimport process) into a string.

Below is the sample ODF XML stream generated by the sdext pdfimport code (using
a pdf file with only one line content "لوحة المفاتيح العربية"):

'''
<draw:text-box>
    <text:p text:style-name="paragraph11">
        <text:span text:style-name="text13"> ة </text:span>
        <text:span text:style-name="text13"> ي </text:span>
        <text:span text:style-name="text13"> ب </text:span>
        <text:span text:style-name="text13"> ر </text:span>
        <text:span text:style-name="text13"> ع </text:span>
        <text:span text:style-name="text13"> ل </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13">
            <text:s text:c="1" text:style-name="text13"> </text:s>
        </text:span>
        <text:span text:style-name="text13"> ح </text:span>
        <text:span text:style-name="text13"> ي </text:span>
        <text:span text:style-name="text13"> ت </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13"> ف </text:span>
        <text:span text:style-name="text13"> م </text:span>
        <text:span text:style-name="text13"> ل </text:span>
        <text:span text:style-name="text13"> ا </text:span>
        <text:span text:style-name="text13">
            <text:s text:c="1" text:style-name="text13"> </text:s>
        </text:span>
        <text:span text:style-name="text13"> ة </text:span>
        <text:span text:style-name="text13"> ح </text:span>
        <text:span text:style-name="text13"> و </text:span>
        <text:span text:style-name="text13"> ل </text:span>
    </text:p>
</draw:text-box>
'''

As we can see above, this produced a lot of text.span with the same style name.
This makes the ODF XML stream huge before it is imported into Draw (this may be
the reason why pdfimport is very slow and memory/CPU consuming for large PDFs). 

The pdfimport code was intended to combine these characters in to a single
string, see
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?#698
where it says it will "concatenate consecutive text elements unless there is a
font or text color or matrix change, leave a new span in that case". 
However, it must have been failed to do so. It failed because, based on my
observation, in the following code block:
'''
                if( (pCur->FontId == pNext->FontId || isSpaces(pNext)) &&
                    rCurGC.FillColor.Red == rNextGC.FillColor.Red &&
                    rCurGC.FillColor.Green == rNextGC.FillColor.Green &&
                    rCurGC.FillColor.Blue == rNextGC.FillColor.Blue &&
                    rCurGC.FillColor.Alpha == rNextGC.FillColor.Alpha &&
                    (rCurGC.Transformation == rNextGC.Transformation ||
notTransformed(rNextGC))
                    )
'''
all the other conditions are true, except the "rCurGC.Transformation ==
rNextGC.Transformation".

Until now I am still not sure why rCurGC.Transformation does not equal to
rNextGC.Transformation while they should be the same.

In sum, the mirrorString code would be reached if the single characters are
successfully combined into a string.

-- 
You are receiving this mail because:
You are the assignee for the bug.

_______________________________________________
Libreoffice-bugs mailing list
Libreoffice-bugs@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/libreoffice-bugs

[Libreoffice-bugs] [Bug 104597] Text runs of RTL scripts (e.g. Arabic, Hebrew, Persian) from imported PDF are reversed, PDFIProcessor::mirrorString not behaving

Reply via email to