[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

JIRA Sat, 12 Feb 2011 10:39:20 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993966#comment-12993966
 ]


Andreas Lehmkühler commented on PDFBOX-956:
-------------------------------------------

The provided pdf contains a lot of crap. There is a section with round about 
21000 lines like the following

(!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)Tj

That leads to round about 632000 "!" in the text (see the attached extraction 
result). That text is invisible because of its size, but it triggers the 
suppress duplicates algorithm od PDFBox which doesn't perform that good.

> Poor text extraction performance in PDFTextStripper.java
> --------------------------------------------------------
>
>                 Key: PDFBOX-956
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-956
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>            Reporter: Kevin Jackson
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, 
> c4ce2fcd_69.pdf
>
>
> The worst case performance of the suppressDuplicateOverlappingText logic in 
> processTextPosition is O(n^2).
> The patch is to use a TreeMap to achieve O(N log N) performance.
> The example PDF took over 2 hours to extract the text before this patch and 
> less than 10 minute after.
> BTW:  The extracted text is also quite different compared to Adobe Reader.  
> Not sure which is correct but for this document it doesn't matter.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PDFBOX-956) Poor text extraction performance in PDFTextStripper.java

Reply via email to