[ 
https://issues.apache.org/jira/browse/PDFBOX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844725#action_12844725
 ] 

Mel Martinez commented on PDFBOX-659:
-------------------------------------

Okay - Villu's comment plus some things I'm seeing suggest its a combo effect.

First off the document renders perfectly in Acrobat Reader so that suggests 
that whatever is going on in the document is probably 'legal' or at least 
something we should be able to handle.

Villu's comment indicates that coordinates are shifted.

When I step through the 'rendering' of the individual TextPosition objects I 
note that for the messed up text, the Y coordinates are shifted into negative 
space.

This shouldn't be a problem - we should be rendering them against an offset 
origin - all that matters is their relative positions.  That's why Acrobat 
Reader renders them correctly.

However, when we do our text extraction, our 'text rendering' process includes 
a step to determine if a TextPosition object is still on the same line as the 
prior TextPosition object.  To do this, it compares the current Y position and 
Y height to the prior Y position and height.   This is fine, except for the 
first time you go through it, it needs some sort of default that it can compare 
to.  The code uses -1.0 as the default 'last' Y position.  From that point, as 
it iterates through, if the current position is above the last Y position, it 
resets the last Y position variable to the current position.

Do you see the problem?  If all the text is being renderded in negative Y 
space, then ALL the Y values are never 'above' the -1.0 value used as the 
default to start the iteration.  So it never properly resets the 'last Y 
position'.  This causes it to incorrectly think it is on a new line when it 
really isn't.  Hence it inserts the newline characters.

I'll have to think this through a bit to make sure the solution is a bit more 
robust.  But I should be able to post a patch early next week.

This also affects my PDFTextStripper2 class ( PDFBOX-521 ) so I will patch that 
at the same time.

> Newlines added in the middle of words
> -------------------------------------
>
>                 Key: PDFBOX-659
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-659
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 0.8.0-incubator
>            Reporter: Mario Sangiorgio
>         Attachments: fulltext.pdf, page.png
>
>
> I am experiencing issues getting the text from a PDF document. The document I 
> want to get the text is a scientific paper.
> The tool works fine, but in the title there are some problems: in the middle 
> of some words I get a newline.
> An example of what I am getting is the following text:
> An
> Asp
> e
> ct-Orien
> ted
> F
> ramew
> o
> rk
> for
> S
> ervice
> A
> d
> aptation
> rather than the expected "An Aspect-Oriented Framework for Service 
> Adaptation".
> Please let me know if I may help finding the bug

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to