[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-3970:
------------------------------------
    Attachment: LegacyPDFStreamEngine.java

It looked good at first but when I tried to run it on the file from PDFBOX-4000 
I got some weird effects... ("N" and "o" had almost the same visual bounds) one 
problem in your code is that you're using the height, instead of using the 
upper Y coordinate. I then removed the translation because we need the shapes 
from the baseline and don't care about the rendering position on the page (this 
is done later in the code). I am attaching my current code. Make a diff to 
yours to see what it is about. It is a bit messy. I see you changed the type3 
code at the bottom, I didn't investigate that, I'm using the existing type3 
code.

I then ran the tests and got many regressions. Some can be explained (e.g. 
PDFBOX-3062-002207-p1.pdf), but at least one is weird, the sorted result of 
PDFBOX-2984-page180°.pdf.

> x,y co-ordinates of the text inside the cell are not getting correctly.
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-3970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>         Environment: Operating system: Windows 7 (64 bit).
>            Reporter: Navnath Kumbhar
>              Labels: how-to
>         Attachments: LegacyPDFStreamEngine.java, formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf, 
> simpleAnnotation.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to