[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2018-01-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339868#comment-16339868
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-

wrong_space_parsed_sample.pdf is from Hesham Gneady from the mailing list and 
fails text extraction with the repository code and succeeds with the modified 
code in this issue.

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
>Priority: Major
>  Labels: how-to
> Attachments: LegacyPDFStreamEngine.java, LegacyPDFStreamEngine.java, 
> formula-marked-34.png, paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf, simpleAnnotation.pdf, wrong_space_parsed_sample.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-11-01 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234527#comment-16234527
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-

IIRC text extraction isn't done on annotations. ???

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
>Priority: Major
>  Labels: how-to
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf, 
> simpleAnnotation.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-31 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226901#comment-16226901
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-

I mean this part:
{code}
BoundingBox bbox = font.getBoundingBox();
if (bbox.getLowerLeftY() < Short.MIN_VALUE)
{
// PDFBOX-2158 and PDFBOX-3130
// files by Salmat eSolutions / ClibPDF Library
bbox.setLowerLeftY(- (bbox.getLowerLeftY() + 65536));
}
// 1/2 the bbox is used as the height todo: why?
float glyphHeight = bbox.getHeight() / 2;

// sometimes the bbox has very high values, but CapHeight is OK
PDFontDescriptor fontDescriptor = font.getFontDescriptor();
if (fontDescriptor != null)
{
float capHeight = fontDescriptor.getCapHeight();
if (capHeight != 0 && (capHeight < glyphHeight || glyphHeight == 0))
{
glyphHeight = capHeight;
}
}
{code}
That should somehow be replaced with some of the code from 
DrawPrintLocations.calculateGlyphBounds().

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
>  Labels: how-to
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-31 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226806#comment-16226806
 ] 

Navnath Kumbhar commented on PDFBOX-3970:
-

You mean in *showGlyph* method?
I have copied some code from *showGlyph* method.
{code:java}
/* 162 */ BoundingBox bbox = font.getBoundingBox();
/* 163 */ if (bbox.getLowerLeftY() < -32768.0F)
/* */ {
/* */ 
/* */ 
/* 167 */   bbox.setLowerLeftY(-(bbox.getLowerLeftY() + 65536.0F));
/* */ }
/* */ 
/* 170 */ float glyphHeight = bbox.getHeight() / 2.0F;
/* */ 
{code}

So, you meant to add the code from DrawPrintTextLocations in 
LegacyPDFStreamEngine.java after *line number 170*?

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
>  Labels: how-to
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-27 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222532#comment-16222532
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-

This seems to be a moving target. Your original question seems to have been 
solved and it was not a bug, so I will close this issue soon.

That's why "how to" questions should not be on JIRA. In the future, please ask 
on the user mailing list (more flexible), or on stack overflow (best for very 
specific questions and has different people). Use JIRA only when told or when 
you know for sure that it's a bug.

Now you brought three new questions:
1) why is the red bound larger than the bounding box? I can't tell because you 
didn't attach a PDF. But this may be a bug, or at least a potential for 
improvement. If you can share the PDF then please open a new issue in JIRA and 
attach your file, I'll see what I can do.
2) why we use the red bounds: to decide whether some glyphs are on the same 
line or not. The red bounds are based on different values in the font 
descriptor, but sadly they are not always accurate. See in 
LegacyPDFStreamEngine.java after the line "font.getBoundingBox()", there's some 
voodoo being done to correct inaccurate values.
3) using the cyan glyph bounds for text extraction: yes, I suspect that this 
would be more accurate than the red bounds. However this won't work for type 3 
fonts, only vector fonts. We did discuss this (using the cyan bounds) among 
committers a few years ago and we suspect that accurate bounds are better, but 
nobody has implemented it. To test this, you'd need to take some of the code in 
DrawPrintTextLocations and use it in LegacyPDFStreamEngine.java where the 
{{glyphHeight}} is calculated. When done, run the text stripper tests and look 
at the differences. If you're satisfied, ask me for the additional test files 
(not in repository because of copyrights) and test with these. If you're going 
to implement this, please take it to the mailing list.

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-26 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220584#comment-16220584
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-

Sorry, my comment was useless. I reread your post again... IMHO your 
calculations are OK, i.e. the java Y coordinates are between 88 and 120, like 
you wrote. Then you wrote
{code}
And bounding box of the last line in that cell is : 102,114,59,7 and hence 
max-Y of that line becomes 121 (min-Y + height)
{code}
I don't know what you mean with "bounding box", i.e. how you calculated that. 
Lets have a look at the first "p" of the bottom line. DrawPrintTextLocations 
brings this line:
{code}
String[102.0,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=6.671997]p
String[108.672,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=6.671997]a
String[115.343994,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=3.9960022]r
String[119.34,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=6.671997]a
String[126.01199,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=6.671997]g
String[132.68399,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=3.9960022]r
String[136.68,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=6.671997]a
String[143.35199,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=6.671997]p
String[150.02399,114.0 fs=12.0 xscale=12.0 height=6.936 space=3.3360004 
width=6.671997]h
{code}
So the y is 114 (java coordinate), the height (which is not a real height, see 
the comment in the code of DrawPrintTextLocations) is almost 7. That one goes 
from the baseline which is why all glyphs have the same y here. But because the 
114 is a java y coordinate (not PDF) you must substract from it to get the 
"high" position, which would be 114-6.936 = 107.064. Smaller java y values = 
higher on your screen. Smaller PDF y values = lower on your screen.

Now if you take the font bounding box, you get -166.0,-225.0,1000.0,931.0. This 
must be divided by 1000 (all fonts except type 3) and transformed with the text 
rendering matrix (here: 12). So min y would be -2.7 and max y would be 11.172. 
Both would have to be substracted from the baseline y.

Did this help? If not, please explain how you got the "bounding box of the last 
line in that cell".


> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-26 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220445#comment-16220445
 ] 

Navnath Kumbhar commented on PDFBOX-3970:
-

Hello Tilman,
Thank you for the reply.

I tried your sample code from your link but still I get the same results for 
the vertical and horizontal lines.


> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218893#comment-16218893
 ] 

Tilman Hausherr commented on PDFBOX-3970:
-

Just catching these values isn't enough, because these might have been 
transformed. Have a look here:
https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions/38933039


> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-25 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218205#comment-16218205
 ] 

Navnath Kumbhar commented on PDFBOX-3970:
-

Hello Tilman,
I also checked the values with your example class *DrawPrintTextLocations*. 
Text coordinate values are same.
*But, I am more concerned with the cell in which that text is located.*

To process the vertical and horizontal lines of the cell, as per my 
understanding, I need to process the path operators like *re*, *m*,*l* .etc. 
I have overridden the method *processOperator* in my own project to process 
those different operators. 

Here is the method *processStreamOperators()* from the pdfbox class 
*PDFStreamEngine*.

{code:java}
   private void processStreamOperators(PDContentStream contentStream)
 throws IOException
   {
 List arguments = new ArrayList();
 PDFStreamParser parser = new PDFStreamParser(contentStream);
 Object token = parser.parseNextToken();
 while (token != null)
 {
   if ((token instanceof COSObject))
   {
arguments.add(((COSObject)token).getObject());
   }
  else if ((token instanceof Operator))
   {
 processOperator((Operator)token, arguments);
 arguments = new ArrayList();
   }
   else
   {
 arguments.add((COSBase)token);
   }
   token = parser.parseNextToken();
 }
   }
{code}

As you can see in the above code, coordinate values that I get for cell's 
vertical and horizontal paths are in the variable *arguments*.
And these arguments are processed in my overridden *processOperator()* method.
For example, here is my operator processing condition (here I am adding only 
for *re* operator) : 
{code:java}
String operation = operator.getName();

if (operation.equals("re")) {
if (configuration.needsExtractTables()) {
Point2D point1 = createPoint(page,
getTransformation(this),

PdfHelper.toDouble(arguments.get(0)),

PdfHelper.toDouble(arguments.get(1)));
Point2D point2 = createPoint(page,
getTransformation(this),

PdfHelper.toDouble(arguments.get(0)) + PdfHelper.toDouble(arguments.get(2)),

PdfHelper.toDouble(arguments.get(1)) + PdfHelper.toDouble(arguments.get(3)));

graphicHandler.start(page, point1);
graphicHandler.add(new 
Point2D.Double(point2.getX(), point1.getY()));
graphicHandler.add(point2);
graphicHandler.add(new 
Point2D.Double(point1.getX(), point2.getY()));
graphicHandler.close();
}
}
{code}

So, as per definition, operator *re* append a rectangle to a current path as a 
complete subpath with lower left corner(x,y) and dimensions width and height in 
user space.

Values in the variable *arguments* that I received from pdfbox are : 
[COSInt{100}, COSInt{672}, COSInt{120}, COSInt{32}]. These values are nothing 
but operands to the operator *re*.

As per the pdf reference document, the syntax of the *re* operator is :
*x y width height re* [4 operands before the operator re]
672 is the Y-value processed from bottom of the page by pdfbox.
When I subtract it from page height, I get the Y value from top of the page 
which is 120. [Page height is 792]

I hope, this will help you.

Thank you in advance!
Regards,
Navnath Kumbhar.





 

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there