PDFBox performance issue: TextPosition performance tweak
---------------------------------------------------------
Key: PDFBOX-599
URL: https://issues.apache.org/jira/browse/PDFBOX-599
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 0.8.0-incubator, 1.0.0
Environment: All
Reporter: Mel Martinez
During text extraction, the TextPosition.getX() and TextPosition.getY() methods
are invoked multiple times on each TextPosition object.
The current code recalculate these values each time the accessor is invoked,
even thought the underlying state from which the values are derived has not
changed.
This is slow.
The getters (getX() and getY()) should be changed to retain the X and Y
attributes in instance fields and only calculate their values once.
Specificaly the following two fields should be added:
private float x = Float.NEGATIVE_INFINITY;
private float y = Float.NEGATIVE_INFINITY;
And the two methods changed to look like so:
public float getX()
{
if(x==Float.NEGATIVE_INFINITY){
x = getXRot(rot);
}
return x;
}
public float getY()
{
if(y==Float.NEGATIVE_INFINITY){
if ((rot == 0) || (rot == 180))
{
y = pageHeight - getYLowerLeftRot(rot);
}
else
{
y = pageWidth - getYLowerLeftRot(rot);
}
}
return y;
}
This provides a very noticeable speedup in the text extraction.
I'll attach a version of the TextPosition.java class that includes this mod.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.