[jira] [Updated] (PDFBOX-878) Incorrect text extraction when text rotation is not 0,90,180,270

Maruan Sahyoun (Jira) Sun, 01 Nov 2020 09:37:16 -0800


     [ 
https://issues.apache.org/jira/browse/PDFBOX-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Maruan Sahyoun updated PDFBOX-878:
----------------------------------
    Attachment: rotation-rotationMagic.txt

> Incorrect text extraction when text rotation is not 0,90,180,270
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-878
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-878
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1, 1.4.0
>            Reporter: David Rodríguez Alfayate
>            Priority: Major
>         Attachments: pdfbox-word-rotation.patch, rotation-rotationMagic.txt, 
> rotation.pdf, rotation_failure.txt
>
>
> Currently text extraction only supports 0, 90, 180 or 270 degrees rotation, 
> so text rotated in another angle is extracted incorrectly. I attached one 
> simple PDF and the text extraction result as output from ExtractText.
> I have made a patch for the current revision (1.4.0) in which I consider any 
> rotation in the current matrix position. I have had to refactor the 
> considering of (0,0) as upper-left since for rotations places outside of the 
> original asumption, could happen that a word or a line could be splitted. 
> Since we have some needs for a project in we are working, I have made changes 
> to the way normalization and line printing is done, in the current codebase 
> the normalize function is returing a List of Strings, my changes make this 
> method return a List of Words, which are ICU normalized and therefore printed 
> in the current writeLine method. In my patch I have also included a sample 
> PDF2XML class, which converts the PDF to a XML, managing each word in a 
> separate way.
> I submit the test-cases and the patch for your consideration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-878) Incorrect text extraction when text rotation is not 0,90,180,270

Reply via email to