[jira] [Comment Edited] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

Navnath Kumbhar (JIRA) Fri, 27 Oct 2017 03:33:26 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222132#comment-16222132
 ]


Navnath Kumbhar edited comment on PDFBOX-3970 at 10/27/17 10:32 AM:
--------------------------------------------------------------------

Hello Tillman,

Thank you for your helpful feedback. I tried the suggestion that you have given 
except the font bounding box part. It worked well.

But It has limitation on some PDF document pages where I am trying to extract 
mathematical formulas. I am attaching, herewith, the result of the 
*DrawPrintTextLocations* for that particular page.
As you will see in the attachment, *big parenthesis* and *Summation* symbols 
are mixing up with its previous lines. [ As far as red rectangles are 
considered which has coordinate values computed by Java as you mentioned in 
your last comment].

Generally, I see the red box is inside the font bounding box. But it is not the 
case in attached example.

Can we just use glyph bounds [the one in cyan color] to extract text as it 
looks the perfect bound for the textposition? If so, can we do it with 
TextPosition class?
If No, what other heuristic we can use in such cases?
Why do we need that red rectangle [which is not real but only a heuristic to 
extract the text]?

Thank you again for your help! 







was (Author: navnath@3ds):
Hello Tillman,

Thank you for your helpful feedback. I tried the suggestion that you have given 
except the font bounding box part. It worked well.

But It has limitation on some PDF document pages where I am trying to extract 
mathematical formulas. I am attaching, herewith, the result of the 
*DrawPrintTextLocations* for that particular page.
As you will see in the attachment, *big parenthesis* and *Summation* symbols 
are mixing up with its previous lines. [ As far as red rectangles are 
considered which has coordinate values computed by Java as you mentioned in 
your last comment].

Generally, I see the red box is inside the font bounding box. But it is not the 
case in attached example.

Can we just use glyph bounds to extract text as it looks the perfect bound for 
the textposition? If so, can we do it with TextPosition class?
If No, what other heuristic we can use in such cases?
Why do we need that red rectangle [which is not real but only a heuristic to 
extract the text]?

Thank you again for your help! 






> x,y co-ordinates of the text inside the cell are not getting correctly.
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-3970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.7
>         Environment: Operating system: Windows 7 (64 bit).
>            Reporter: Navnath Kumbhar
>         Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

Reply via email to