[
https://issues.apache.org/jira/browse/PDFBOX-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved PDFBOX-600.
----------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
Assignee: Jukka Zitting
Simple yet effective, nice! Committed in revision 899474.
> PDFBox performance issue: PDFTextStripper performance tweak
> ------------------------------------------------------------
>
> Key: PDFBOX-600
> URL: https://issues.apache.org/jira/browse/PDFBOX-600
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: All
> Reporter: Mel Martinez
> Assignee: Jukka Zitting
> Fix For: 1.0.0
>
> Attachments: PDFTextStripper.java
>
>
> During text extraction, the PDFTextStripper needs to calculate textposition
> proximities in order to determine if text elements are overlapping either
> vertically or horizontally.
> As part of this, the PDFTextStripper.within(float first, float second, float
> variance) method is used.
> The current (0.8.0) version of this method uses the following test: second
> > first - variance && second < first + variance
> This is accurate, but slower in my test documents than if you flip the test
> order: second < first + variance && second > first - variance
> This is because the second test fails out faster on left-to-right text. I
> believe that should be the default case.
> Please change the PDFTextStripper.within() method to use the second form of
> the test. I.E. to:
> private boolean within( float first, float second, float variance )
> {
> return second < first + variance && second > first - variance;
> }
> Thanks!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.