[
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220584#comment-16220584
]
Tilman Hausherr commented on PDFBOX-3970:
-----------------------------------------
Sorry, my comment was useless. I reread your post again... IMHO your
calculations are OK, i.e. the java Y coordinates are between 88 and 120, like
you wrote. Then you wrote
{code}
And bounding box of the last line in that cell is : 102,114,59,7 and hence
max-Y of that line becomes 121 (min-Y + height)
{code}
I don't know what you mean with "bounding box", i.e. how you calculated that.
Lets have a look at the first "p" of the bottom line. DrawPrintTextLocations
brings this line:
{code}
String[102.0,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=6.671997]p
String[108.672,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=6.671997]a
String[115.343994,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=3.9960022]r
String[119.34,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=6.671997]a
String[126.01199,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=6.671997]g
String[132.68399,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=3.9960022]r
String[136.68,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=6.671997]a
String[143.35199,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=6.671997]p
String[150.02399,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004
width=6.671997]h
{code}
So the y is 114 (java coordinate), the height (which is not a real height, see
the comment in the code of DrawPrintTextLocations) is almost 7. That one goes
from the baseline which is why all glyphs have the same y here. But because the
114 is a java y coordinate (not PDF) you must substract from it to get the
"high" position, which would be 114-6.936 = 107.064. Smaller java y values =
higher on your screen. Smaller PDF y values = lower on your screen.
Now if you take the font bounding box, you get -166.0,-225.0,1000.0,931.0. This
must be divided by 1000 (all fonts except type 3) and transformed with the text
rendering matrix (here: 12). So min y would be -2.7 and max y would be 11.172.
Both would have to be substracted from the baseline y.
Did this help? If not, please explain how you got the "bounding box of the last
line in that cell".
> x,y co-ordinates of the text inside the cell are not getting correctly.
> -----------------------------------------------------------------------
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
> Reporter: Navnath Kumbhar
> Attachments: paragraphNextToTable-marked-1.png,
> paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table:
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value
> given by pdfbox from height of the page as I see that for paths, y-values are
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence
> max-Y of that line becomes 121 (min-Y + height)
>
> So, if we consider max-Y value of that cell (i.e. 120) and that of last line
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]