[ 
https://issues.apache.org/jira/browse/PDFBOX-3833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056369#comment-16056369
 ] 

Tilman Hausherr commented on PDFBOX-3833:
-----------------------------------------

The cause of the problem is that the "ー" slightly overlaps with the next glyph. 
The code from PDFTextStripper then thinks they belong together:
{code:java}
    TextPosition previousTextPosition = textList.get(textList.size() - 1);
    if (text.isDiacritic() && previousTextPosition.contains(text))
    {
        previousTextPosition.mergeDiacritic(text);
    }
    // If the previous TextPosition was the diacritic, merge it into this
    // one and remove it from the list.
    else if (previousTextPosition.isDiacritic() && 
text.contains(previousTextPosition))
    {
        text.mergeDiacritic(previousTextPosition);
        textList.remove(textList.size() - 1);
        textList.add(text);
    }
    else
    {
        textList.add(text);
    }
{code}
{{TextPosition.contains()}} allows an overlap of 15%. I had to set it to 25% 
for the extraction to work properly.

Another solution would be have {{isDiacritic()}} return false for "ー".

> Characters in wrong order
> -------------------------
>
>                 Key: PDFBOX-3833
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3833
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.5
>            Reporter: Christopher Creutzig
>         Attachments: ML_mathworks_unc2.pdf, PDFBOX-3833-reduced.pdf
>
>
> The attached pdf file (which is page 3 of 
> https://jp.mathworks.com/tagteam/89688_93050v00_JP_machine_learning_section1_ebook.pdf)
>  shows multiple problems when reading with PDFBox in standard settings. This 
> bug report in particular is about the Katakana ー being misplaced.
> In the text block on the left, the second line starts with ターン. 
> PDFTextStripper.getText returns text starting with タ ンー (i.e., adding a space 
> after the first character and swapping the second and third one). This effect 
> also happens at other places in the (complete) file.
> The PDF itself at this point has [<03BB>43.9 <0294>156 <03EF>-24.5 ...]TJ, 
> listing the characters in the proper order. Copy&paste using Apple's 
> Preview.App also preserves that order.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to