[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933076#comment-14933076
 ] 

Andreas Meier commented on PDFBOX-2252:
---------------------------------------

Most of the files I see do not provide any article information.

I think a basic document layout analysis will be needed to solve some of the 
existing problems in the text extraction.

If you can remember some of the projects, let me know.

The projects I have discovered so far, were either not in Java or implemented 
for a specific Java version.
Haven't found a stable and fast Java solution yet.

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Assignee: Maruan Sahyoun
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to