[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Kevin Jackson (JIRA) Thu, 10 Feb 2011 20:24:23 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993338#comment-12993338
 ]


Kevin Jackson commented on PDFBOX-956:
--------------------------------------

I replaced the original sample PDF file with one that did not contain 
JavaScript.
Yes, Adobe Reader ALSO has performance problems with this evil file.
This fix addresses the performance problem in PDFBox.


> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: PDFTextStripper.java.patch, c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in 
> processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and 
> less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  
> Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Reply via email to