[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222532#comment-16222532
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-----------------------------------------

This seems to be a moving target. Your original question seems to have been 
solved and it was not a bug, so I will close this issue soon.

That's why "how to" questions should not be on JIRA. In the future, please ask 
on the user mailing list (more flexible), or on stack overflow (best for very 
specific questions and has different people). Use JIRA only when told or when 
you know for sure that it's a bug.

Now you brought three new questions:
1) why is the red bound larger than the bounding box? I can't tell because you 
didn't attach a PDF. But this may be a bug, or at least a potential for 
improvement. If you can share the PDF then please open a new issue in JIRA and 
attach your file, I'll see what I can do.
2) why we use the red bounds: to decide whether some glyphs are on the same 
line or not. The red bounds are based on different values in the font 
descriptor, but sadly they are not always accurate. See in 
LegacyPDFStreamEngine.java after the line "font.getBoundingBox()", there's some 
voodoo being done to correct inaccurate values.
3) using the cyan glyph bounds for text extraction: yes, I suspect that this 
would be more accurate than the red bounds. However this won't work for type 3 
fonts, only vector fonts. We did discuss this (using the cyan bounds) among 
committers a few years ago and we suspect that accurate bounds are better, but 
nobody has implemented it. To test this, you'd need to take some of the code in 
DrawPrintTextLocations and use it in LegacyPDFStreamEngine.java where the 
{{glyphHeight}} is calculated. When done, run the text stripper tests and look 
at the differences. If you're satisfied, ask me for the additional test files 
(not in repository because of copyrights) and test with these. If you're going 
to implement this, please take it to the mailing list.

> x,y co-ordinates of the text inside the cell are not getting correctly.
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-3970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>         Environment: Operating system: Windows 7 (64 bit).
>            Reporter: Navnath Kumbhar
>         Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to