PDFBox performance issue: PDFTextStripper performance tweak
------------------------------------------------------------
Key: PDFBOX-600
URL: https://issues.apache.org/jira/browse/PDFBOX-600
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 0.8.0-incubator
Environment: All
Reporter: Mel Martinez
During text extraction, the PDFTextStripper needs to calculate textposition
proximities in order to determine if text elements are overlapping either
vertically or horizontally.
As part of this, the PDFTextStripper.within(float first, float second, float
variance) method is used.
The current (0.8.0) version of this method uses the following test: second >
first - variance && second < first + variance
This is accurate, but slower in my test documents than if you flip the test
order: second < first + variance && second > first - variance
This is because the second test fails out faster on left-to-right text. I
believe that should be the default case.
Please change the PDFTextStripper.within() method to use the second form of the
test. I.E. to:
private boolean within( float first, float second, float variance )
{
return second < first + variance && second > first - variance;
}
Thanks!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.