[jira] [Commented] (PDFBOX-3986) Bounding box of mathematical symbols are not proper

2017-11-01 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16234162#comment-16234162
 ] 

Navnath Kumbhar commented on PDFBOX-3986:
-

What do you mean by *_Font itself insists to do that?_*

> Bounding box of mathematical symbols are not proper
> ---
>
> Key: PDFBOX-3986
> URL: https://issues.apache.org/jira/browse/PDFBOX-3986
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Windows 7 (64 bit)
>Reporter: Navnath Kumbhar
>Priority: Major
> Attachments: PDFBOX-3986-reduced.pdf, formula-marked-34.png, 
> formula-marked-37.png, formula.pdf
>
>
> Hello Support Team,
> I am working on a task where I have to extract formulas from PDF document and 
> convert them into images.
> But when I extract them using PDFBox, some of the symbols like *Summation*, 
> *Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.
> I checked the output of DrawPrintTextLocations example with that particular 
> PDF document and result does not look normal.
> Red boxes are not aligned properly in the output as you will see in the 
> attachment files.
> I am, herewith, attaching the output of two pages and PDF document itself.
> *Please refer page no. 34 or 37 for this issue.*
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-11-01 Thread Navnath Kumbhar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navnath Kumbhar updated PDFBOX-3970:

Attachment: simpleAnnotation.pdf

Hello Tilman,

Thank you for pointing out the right code snippet. I have done some changes in 
the LegacyPDFStreamEngine.java

Below is my code change:


{code:java}
@Override
protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, 
String unicode,
 Vector displacement) throws IOException
{
//
// legacy calculations which were previously in PDFStreamEngine
//
//  DO NOT USE THIS CODE UNLESS YOU ARE WORKING WITH PDFTextStripper.
//  THIS CODE IS DELIBERATELY INCORRECT
//

PDGraphicsState state = getGraphicsState();
Matrix ctm = state.getCurrentTransformationMatrix();
float fontSize = state.getTextState().getFontSize();
float horizontalScaling = state.getTextState().getHorizontalScaling() / 
100f;
Matrix textMatrix = getTextMatrix();

Shape glyphShape = getActualGlyphBoundingBox(textRenderingMatrix, font, 
code); 

BoundingBox bbox =  new 
BoundingBox((float)glyphShape.getBounds2D().getMinX(), 
(float)glyphShape.getBounds2D().getMinY(), 
(float)glyphShape.getBounds2D().getMaxX(), 
(float)glyphShape.getBounds2D().getMaxY());
if (bbox.getLowerLeftY() < Short.MIN_VALUE)
{
// PDFBOX-2158 and PDFBOX-3130
// files by Salmat eSolutions / ClibPDF Library
bbox.setLowerLeftY(- (bbox.getLowerLeftY() + 65536));
}
// 1/2 the bbox is used as the height todo: why?
float glyphHeight = bbox.getHeight()/2;

/*PDFontDescriptor fontDescriptor = font.getFontDescriptor();
if (fontDescriptor != null)
{
float capHeight = fontDescriptor.getCapHeight();
if (capHeight != 0 && (capHeight < glyphHeight || glyphHeight == 0))
{
glyphHeight = capHeight;
}
}*/

// transformPoint from glyph space -> text space
float height;
if (font instanceof PDType3Font)
{
height = font.getFontMatrix().transformPoint(0, glyphHeight).y;
}
else
{
height = glyphHeight / 1000;
}

.
.
.
}

{code}

And here is *getActualGlyphBoundingBox()* method.



{code:java}
   private Shape getActualGlyphBoundingBox(Matrix textRenderingMatrix, PDFont 
font, int code) throws IOException {
GeneralPath path = null;
AffineTransform at = textRenderingMatrix.createAffineTransform();
at.concatenate(font.getFontMatrix().createAffineTransform());
if (font instanceof PDType3Font)
{
PDType3Font t3Font = (PDType3Font) font;
PDType3CharProc charProc = t3Font.getCharProc(code);
if (charProc != null)
{
PDRectangle glyphBBox = charProc.getGlyphBBox();
if (glyphBBox != null)
{
path = glyphBBox.toGeneralPath();
}
}
}
else if (font instanceof PDVectorFont)
{
PDVectorFont vectorFont = (PDVectorFont) font;
path = vectorFont.getPath(code);

if (font instanceof PDTrueTypeFont)
{
PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
int unitsPerEm = 
ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
}
if (font instanceof PDType0Font)
{
PDType0Font t0font = (PDType0Font) font;
if (t0font.getDescendantFont() instanceof PDCIDFontType2)
{
int unitsPerEm = ((PDCIDFontType2) 
t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
}
}
}
else if (font instanceof PDSimpleFont)
{
PDSimpleFont simpleFont = (PDSimpleFont) font;

// these two lines do not always work, e.g. for the TT fonts in 
file 032431.pdf
// which is why PDVectorFont is tried first.
String name = simpleFont.getEncoding().getName(code);
path = simpleFont.getPath(name);
}
else
{
// shouldn't happen, please open issue in JIRA
System.out.println("Unknown font class: " + font.getClass());
}
if (path == null)
{
return null;
}  
   
   //return at.createTransformedShape(path.getBounds2D());   
return path.getBounds2D();
}
{code}


I am getting satisfactory results for text 

[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-31 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226806#comment-16226806
 ] 

Navnath Kumbhar commented on PDFBOX-3970:
-

You mean in *showGlyph* method?
I have copied some code from *showGlyph* method.
{code:java}
/* 162 */ BoundingBox bbox = font.getBoundingBox();
/* 163 */ if (bbox.getLowerLeftY() < -32768.0F)
/* */ {
/* */ 
/* */ 
/* 167 */   bbox.setLowerLeftY(-(bbox.getLowerLeftY() + 65536.0F));
/* */ }
/* */ 
/* 170 */ float glyphHeight = bbox.getHeight() / 2.0F;
/* */ 
{code}

So, you meant to add the code from DrawPrintTextLocations in 
LegacyPDFStreamEngine.java after *line number 170*?

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
>  Labels: how-to
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3986) Bounding box of mathematical symbols are not proper

2017-10-31 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226656#comment-16226656
 ] 

Navnath Kumbhar commented on PDFBOX-3986:
-

I am using PDFBox version 2.0.7

> Bounding box of mathematical symbols are not proper
> ---
>
> Key: PDFBOX-3986
> URL: https://issues.apache.org/jira/browse/PDFBOX-3986
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
> Environment: Windows 7 (64 bit)
>Reporter: Navnath Kumbhar
> Attachments: formula-marked-34.png, formula-marked-37.png, formula.pdf
>
>
> Hello Support Team,
> I am working on a task where I have to extract formulas from PDF document and 
> convert them into images.
> But when I extract them using PDFBox, some of the symbols like *Summation*, 
> *Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.
> I checked the output of DrawPrintTextLocations example with that particular 
> PDF document and result does not look normal.
> Red boxes are not aligned properly in the output as you will see in the 
> attachment files.
> I am, herewith, attaching the output of two pages and PDF document itself.
> *Please refer page no. 34 or 37 for this issue.*
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3986) Bounding box of mathematical symbols are not proper

2017-10-31 Thread Navnath Kumbhar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navnath Kumbhar updated PDFBOX-3986:

Description: 
Hello Support Team,

I am working on a task where I have to extract formulas from PDF document and 
convert them into images.

But when I extract them using PDFBox, some of the symbols like *Summation*, 
*Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.

I checked the output of DrawPrintTextLocations example with that particular PDF 
document and result does not look normal.
Red boxes are not aligned properly in the output as you can see.

I am, herewith, attaching the output of two pages and PDF document itself.

*Please refer page no. 34 or 37 for this issue.*

Thank you in advance!

  was:
Hello Support Team,

I am working on a task where I have to extract formulas from PDF document and 
convert them into images.

But when I extract them using PDFBox, some of the symbols like *Summation*, 
*Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.

I checked the output of DrawPrintTextLocations example with that particular PDF 
document.
I am, herewith, attaching the output of two pages and PDF document itself.

*Please refer page no. 34 or 37 for this issue.*

Thank you in advance!


> Bounding box of mathematical symbols are not proper
> ---
>
> Key: PDFBOX-3986
> URL: https://issues.apache.org/jira/browse/PDFBOX-3986
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
> Environment: Windows 7 (64 bit)
>Reporter: Navnath Kumbhar
> Attachments: formula-marked-34.png, formula-marked-37.png, formula.pdf
>
>
> Hello Support Team,
> I am working on a task where I have to extract formulas from PDF document and 
> convert them into images.
> But when I extract them using PDFBox, some of the symbols like *Summation*, 
> *Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.
> I checked the output of DrawPrintTextLocations example with that particular 
> PDF document and result does not look normal.
> Red boxes are not aligned properly in the output as you can see.
> I am, herewith, attaching the output of two pages and PDF document itself.
> *Please refer page no. 34 or 37 for this issue.*
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3986) Bounding box of mathematical symbols are not proper

2017-10-31 Thread Navnath Kumbhar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navnath Kumbhar updated PDFBOX-3986:

Description: 
Hello Support Team,

I am working on a task where I have to extract formulas from PDF document and 
convert them into images.

But when I extract them using PDFBox, some of the symbols like *Summation*, 
*Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.

I checked the output of DrawPrintTextLocations example with that particular PDF 
document and result does not look normal.
Red boxes are not aligned properly in the output as you will see in the 
attachment files.

I am, herewith, attaching the output of two pages and PDF document itself.

*Please refer page no. 34 or 37 for this issue.*

Thank you in advance!

  was:
Hello Support Team,

I am working on a task where I have to extract formulas from PDF document and 
convert them into images.

But when I extract them using PDFBox, some of the symbols like *Summation*, 
*Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.

I checked the output of DrawPrintTextLocations example with that particular PDF 
document and result does not look normal.
Red boxes are not aligned properly in the output as you can see.

I am, herewith, attaching the output of two pages and PDF document itself.

*Please refer page no. 34 or 37 for this issue.*

Thank you in advance!


> Bounding box of mathematical symbols are not proper
> ---
>
> Key: PDFBOX-3986
> URL: https://issues.apache.org/jira/browse/PDFBOX-3986
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
> Environment: Windows 7 (64 bit)
>Reporter: Navnath Kumbhar
> Attachments: formula-marked-34.png, formula-marked-37.png, formula.pdf
>
>
> Hello Support Team,
> I am working on a task where I have to extract formulas from PDF document and 
> convert them into images.
> But when I extract them using PDFBox, some of the symbols like *Summation*, 
> *Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.
> I checked the output of DrawPrintTextLocations example with that particular 
> PDF document and result does not look normal.
> Red boxes are not aligned properly in the output as you will see in the 
> attachment files.
> I am, herewith, attaching the output of two pages and PDF document itself.
> *Please refer page no. 34 or 37 for this issue.*
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Created] (PDFBOX-3986) Bounding box of mathematical symbols are not proper

2017-10-31 Thread Navnath Kumbhar (JIRA)
Navnath Kumbhar created PDFBOX-3986:
---

 Summary: Bounding box of mathematical symbols are not proper
 Key: PDFBOX-3986
 URL: https://issues.apache.org/jira/browse/PDFBOX-3986
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
 Environment: Windows 7 (64 bit)
Reporter: Navnath Kumbhar
 Attachments: formula-marked-34.png, formula-marked-37.png, formula.pdf

Hello Support Team,

I am working on a task where I have to extract formulas from PDF document and 
convert them into images.

But when I extract them using PDFBox, some of the symbols like *Summation*, 
*Integral*, or *Big Parenthesis* .etc are mixing up with its previous line.

I checked the output of DrawPrintTextLocations example with that particular PDF 
document.
I am, herewith, attaching the output of two pages and PDF document itself.

*Please refer page no. 34 or 37 for this issue.*

Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-27 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222132#comment-16222132
 ] 

Navnath Kumbhar edited comment on PDFBOX-3970 at 10/27/17 10:32 AM:


Hello Tillman,

Thank you for your helpful feedback. I tried the suggestion that you have given 
except the font bounding box part. It worked well.

But It has limitation on some PDF document pages where I am trying to extract 
mathematical formulas. I am attaching, herewith, the result of the 
*DrawPrintTextLocations* for that particular page.
As you will see in the attachment, *big parenthesis* and *Summation* symbols 
are mixing up with its previous lines. [ As far as red rectangles are 
considered which has coordinate values computed by Java as you mentioned in 
your last comment].

Generally, I see the red box is inside the font bounding box. But it is not the 
case in attached example.

Can we just use glyph bounds [the one in cyan color] to extract text as it 
looks the perfect bound for the textposition? If so, can we do it with 
TextPosition class?
If No, what other heuristic we can use in such cases?
Why do we need that red rectangle [which is not real but only a heuristic to 
extract the text]?

Thank you again for your help! 







was (Author: navnath@3ds):
Hello Tillman,

Thank you for your helpful feedback. I tried the suggestion that you have given 
except the font bounding box part. It worked well.

But It has limitation on some PDF document pages where I am trying to extract 
mathematical formulas. I am attaching, herewith, the result of the 
*DrawPrintTextLocations* for that particular page.
As you will see in the attachment, *big parenthesis* and *Summation* symbols 
are mixing up with its previous lines. [ As far as red rectangles are 
considered which has coordinate values computed by Java as you mentioned in 
your last comment].

Generally, I see the red box is inside the font bounding box. But it is not the 
case in attached example.

Can we just use glyph bounds to extract text as it looks the perfect bound for 
the textposition? If so, can we do it with TextPosition class?
If No, what other heuristic we can use in such cases?
Why do we need that red rectangle [which is not real but only a heuristic to 
extract the text]?

Thank you again for your help! 






> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-27 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222132#comment-16222132
 ] 

Navnath Kumbhar edited comment on PDFBOX-3970 at 10/27/17 10:27 AM:


Hello Tillman,

Thank you for your helpful feedback. I tried the suggestion that you have given 
except the font bounding box part. It worked well.

But It has limitation on some PDF document pages where I am trying to extract 
mathematical formulas. I am attaching, herewith, the result of the 
*DrawPrintTextLocations* for that particular page.
As you will see in the attachment, *big parenthesis* and *Summation* symbols 
are mixing up with its previous lines. [ As far as red rectangles are 
considered which has coordinate values computed by Java as you mentioned in 
your last comment].

Generally, I see the red box is inside the font bounding box. But it is not the 
case in attached example.

Can we just use glyph bounds to extract text as it looks the perfect bound for 
the textposition? If so, can we do it with TextPosition class?
If No, what other heuristic we can use in such cases?
Why do we need that red rectangle [which is not real but only a heuristic to 
extract the text]?

Thank you again for your help! 







was (Author: navnath@3ds):
Hello Tillman,

Thank you for your helpful feedback. I tried the suggestion that you have given 
except the font bounding box part. It worked well.

But It has limitation on some PDF document pages where I am trying to extract 
mathematical formulas. I am attaching, herewith, the result of the 
*DrawPrintTextLocations* for that particular page.
As you will see in the attachment, *big parenthesis* and *Summation *symbols 
are mixing up with its previous lines. [ As far as red rectangles are 
considered which has coordinate values computed by Java as you mentioned in 
your last comment].

Generally, I see the red box is inside the font bounding box. But it is not the 
case in attached example.

Can we just use glyph bounds to extract text as it looks the perfect bound for 
the textposition? If so, can we do it with TextPosition class?
If No, what other heuristic we can use in such cases?
Why do we need that red rectangle [which is not real but only a heuristic to 
extract the text]?

Thank you again for your help! 






> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Updated] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-27 Thread Navnath Kumbhar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navnath Kumbhar updated PDFBOX-3970:

Attachment: formula-marked-34.png

Hello Tillman,

Thank you for your helpful feedback. I tried the suggestion that you have given 
except the font bounding box part. It worked well.

But It has limitation on some PDF document pages where I am trying to extract 
mathematical formulas. I am attaching, herewith, the result of the 
*DrawPrintTextLocations* for that particular page.
As you will see in the attachment, *big parenthesis* and *Summation *symbols 
are mixing up with its previous lines. [ As far as red rectangles are 
considered which has coordinate values computed by Java as you mentioned in 
your last comment].

Generally, I see the red box is inside the font bounding box. But it is not the 
case in attached example.

Can we just use glyph bounds to extract text as it looks the perfect bound for 
the textposition? If so, can we do it with TextPosition class?
If No, what other heuristic we can use in such cases?
Why do we need that red rectangle [which is not real but only a heuristic to 
extract the text]?

Thank you again for your help! 






> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: formula-marked-34.png, 
> paragraphNextToTable-marked-1.png, paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-26 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220445#comment-16220445
 ] 

Navnath Kumbhar commented on PDFBOX-3970:
-

Hello Tilman,
Thank you for the reply.

I tried your sample code from your link but still I get the same results for 
the vertical and horizontal lines.


> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there is one cell along-with text in it 
> and a text paragraph next to that cell.
> x-y coordinates that I get from pdfbox for all the paths (two vertical and 
> two horizontal lines) of the cell are:
> (in x1,y1,x2,y2 format)
> Horizontal line 1: [100,88,220,88]
> Horizontal line 2: [100,120,220,120]
> Vertical line 1 : [100,88,100,120]
> Vertical line 2: [220,88,220,120]
> (Y values of the above paths are final values by subtracting the actual value 
> given by pdfbox from height of the page as I see that for paths, y-values are 
> processed from bottom to up)
> And bounding box of the last line in that cell is : [102,114,59,7] and hence 
> max-Y of that line becomes 121 (min-Y + height)
>  
> So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
> in that cell (i.e. 121), clearly, that line goes out of that cell.
> What can be the possible reason for this?
> Thank you in advance!
> Regards,
> Navnath Kumbhar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Comment Edited] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-25 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218205#comment-16218205
 ] 

Navnath Kumbhar edited comment on PDFBOX-3970 at 10/25/17 7:55 AM:
---

Hello Tilman,
I also checked the values with your example class *DrawPrintTextLocations*. 
Text coordinate values are same.
*But, I am more concerned with the cell in which that text is located.*

To process the vertical and horizontal lines of the cell, as per my 
understanding, I need to process the path operators like *re*, *m*,*l* .etc. 
I have overridden the method *processOperator* in my own project to process 
those different operators. 

Here is the method *processStreamOperators()* from the pdfbox class 
*PDFStreamEngine*.

{code:java}
   private void processStreamOperators(PDContentStream contentStream)
 throws IOException
   {
 List arguments = new ArrayList();
 PDFStreamParser parser = new PDFStreamParser(contentStream);
 Object token = parser.parseNextToken();
 while (token != null)
 {
   if ((token instanceof COSObject))
   {
arguments.add(((COSObject)token).getObject());
   }
  else if ((token instanceof Operator))
   {
 processOperator((Operator)token, arguments);
 arguments = new ArrayList();
   }
   else
   {
 arguments.add((COSBase)token);
   }
   token = parser.parseNextToken();
 }
   }
{code}

As you can see in the above code, coordinate values that I get for cell's 
vertical and horizontal paths are in the variable *arguments*.
And these arguments are processed in my overridden *processOperator()* method.
For example, here is my operator processing condition (here I am adding only 
for *re* operator) : 
{code:java}
@Override
protected void processOperator(Operator operator, List 
arguments) throws IOException {
String operation = operator.getName();

if (operation.equals("re")) {
if (configuration.needsExtractTables()) {
Point2D point1 = createPoint(page,
getTransformation(this),

PdfHelper.toDouble(arguments.get(0)),

PdfHelper.toDouble(arguments.get(1)));
Point2D point2 = createPoint(page,
getTransformation(this),

PdfHelper.toDouble(arguments.get(0)) + PdfHelper.toDouble(arguments.get(2)),

PdfHelper.toDouble(arguments.get(1)) + PdfHelper.toDouble(arguments.get(3)));

graphicHandler.start(page, point1);
graphicHandler.add(new 
Point2D.Double(point2.getX(), point1.getY()));
graphicHandler.add(point2);
graphicHandler.add(new 
Point2D.Double(point1.getX(), point2.getY()));
graphicHandler.close();
}
}
}
{code}

So, as per definition, operator *re* append a rectangle to a current path as a 
complete subpath with lower left corner(x,y) and dimensions width and height in 
user space.

Values in the variable *arguments* that I received from pdfbox are : 
[COSInt{100}, COSInt{672}, COSInt{120}, COSInt{32}]. These values are nothing 
but operands to the operator *re*.

As per the pdf reference document, the syntax of the *re* operator is :
*x y width height re* [4 operands before the operator re]
672 is the Y-value processed from bottom of the page by pdfbox.
When I subtract it from page height, I get the Y value from top of the page 
which is 120. [Page height is 792]

I hope, this will help you.

Thank you in advance!
Regards,
Navnath Kumbhar.





 


was (Author: navnath@3ds):
Hello Tilman,
I also checked the values with your example class *DrawPrintTextLocations*. 
Text coordinate values are same.
*But, I am more concerned with the cell in which that text is located.*

To process the vertical and horizontal lines of the cell, as per my 
understanding, I need to process the path operators like *re*, *m*,*l* .etc. 
I have overridden the method *processOperator* in my own project to process 
those different operators. 

Here is the method *processStreamOperators()* from the pdfbox class 
*PDFStreamEngine*.

{code:java}
   private void processStreamOperators(PDContentStream contentStream)
 throws IOException
   {
 List arguments = new ArrayList();
 PDFStreamParser parser = new PDFStreamParser(contentStream);
 Object token = parser.parseNextToken();
 while (token != null)
 {
   if ((token instanceof COSObject))
  

[jira] [Commented] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-25 Thread Navnath Kumbhar (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-3970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16218205#comment-16218205
 ] 

Navnath Kumbhar commented on PDFBOX-3970:
-

Hello Tilman,
I also checked the values with your example class *DrawPrintTextLocations*. 
Text coordinate values are same.
*But, I am more concerned with the cell in which that text is located.*

To process the vertical and horizontal lines of the cell, as per my 
understanding, I need to process the path operators like *re*, *m*,*l* .etc. 
I have overridden the method *processOperator* in my own project to process 
those different operators. 

Here is the method *processStreamOperators()* from the pdfbox class 
*PDFStreamEngine*.

{code:java}
   private void processStreamOperators(PDContentStream contentStream)
 throws IOException
   {
 List arguments = new ArrayList();
 PDFStreamParser parser = new PDFStreamParser(contentStream);
 Object token = parser.parseNextToken();
 while (token != null)
 {
   if ((token instanceof COSObject))
   {
arguments.add(((COSObject)token).getObject());
   }
  else if ((token instanceof Operator))
   {
 processOperator((Operator)token, arguments);
 arguments = new ArrayList();
   }
   else
   {
 arguments.add((COSBase)token);
   }
   token = parser.parseNextToken();
 }
   }
{code}

As you can see in the above code, coordinate values that I get for cell's 
vertical and horizontal paths are in the variable *arguments*.
And these arguments are processed in my overridden *processOperator()* method.
For example, here is my operator processing condition (here I am adding only 
for *re* operator) : 
{code:java}
String operation = operator.getName();

if (operation.equals("re")) {
if (configuration.needsExtractTables()) {
Point2D point1 = createPoint(page,
getTransformation(this),

PdfHelper.toDouble(arguments.get(0)),

PdfHelper.toDouble(arguments.get(1)));
Point2D point2 = createPoint(page,
getTransformation(this),

PdfHelper.toDouble(arguments.get(0)) + PdfHelper.toDouble(arguments.get(2)),

PdfHelper.toDouble(arguments.get(1)) + PdfHelper.toDouble(arguments.get(3)));

graphicHandler.start(page, point1);
graphicHandler.add(new 
Point2D.Double(point2.getX(), point1.getY()));
graphicHandler.add(point2);
graphicHandler.add(new 
Point2D.Double(point1.getX(), point2.getY()));
graphicHandler.close();
}
}
{code}

So, as per definition, operator *re* append a rectangle to a current path as a 
complete subpath with lower left corner(x,y) and dimensions width and height in 
user space.

Values in the variable *arguments* that I received from pdfbox are : 
[COSInt{100}, COSInt{672}, COSInt{120}, COSInt{32}]. These values are nothing 
but operands to the operator *re*.

As per the pdf reference document, the syntax of the *re* operator is :
*x y width height re* [4 operands before the operator re]
672 is the Y-value processed from bottom of the page by pdfbox.
When I subtract it from page height, I get the Y value from top of the page 
which is 120. [Page height is 792]

I hope, this will help you.

Thank you in advance!
Regards,
Navnath Kumbhar.





 

> x,y co-ordinates of the text inside the cell are not getting correctly.
> ---
>
> Key: PDFBOX-3970
> URL: https://issues.apache.org/jira/browse/PDFBOX-3970
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 2.0.7
> Environment: Operating system: Windows 7 (64 bit).
>Reporter: Navnath Kumbhar
> Attachments: paragraphNextToTable-marked-1.png, 
> paragraphNextToTable.pdf
>
>
> Hello Support Team,
> I am working on a project which parses a whole PDF document and stores the 
> extracted text in some .txt file which can be read by other product.
> My issue is regarding extracting the text inside the cell of a table: 
> *x,y co-ordinates of the text inside the cell are not getting correctly.*
> Y value of the last text line in the cell is getting larger than cell's max-Y 
> value.
> I have attached the test file with this bug.
> As you can see in the test document, there 

[jira] [Created] (PDFBOX-3970) x,y co-ordinates of the text inside the cell are not getting correctly.

2017-10-18 Thread Navnath Kumbhar (JIRA)
Navnath Kumbhar created PDFBOX-3970:
---

 Summary: x,y co-ordinates of the text inside the cell are not 
getting correctly.
 Key: PDFBOX-3970
 URL: https://issues.apache.org/jira/browse/PDFBOX-3970
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 2.0.7
 Environment: Operating system: Windows 7 (64 bit).
Reporter: Navnath Kumbhar
 Attachments: paragraphNextToTable.pdf

Hello Support Team,

I am working on a project which parses a whole PDF document and stores the 
extracted text in some .txt file which can be read by other product.

My issue is regarding extracting the text inside the cell of a table: 
*x,y co-ordinates of the text inside the cell are not getting correctly.*
Y value of the last text line in the cell is getting larger than cell's max-Y 
value.

I have attached the test file with this bug.

As you can see in the test document, there is one cell along-with text in it 
and a text paragraph next to that cell.

x-y coordinates that I get from pdfbox for all the paths (two vertical and two 
horizontal lines) of the cell are:
(in x1,y1,x2,y2 format)
Horizontal line 1: [100,88,220,88]
Horizontal line 2: [100,120,220,120]
Vertical line 1 : [100,88,100,120]
Vertical line 2: [220,88,220,120]

(Y values of the above paths are final values by subtracting the actual value 
given by pdfbox from height of the page as I see that for paths, y-values are 
processed from bottom to up)

And bounding box of the last line in that cell is : [102,114,59,7] and hence 
max-Y of that line becomes 121 (min-Y + height)
 
So, if we consider max-Y value of that cell (i.e. 120)  and that of last line 
in that cell (i.e. 121), clearly, that line goes out of that cell.

What can be the possible reason for this?

Thank you in advance!
Regards,
Navnath Kumbhar





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org