[ https://issues.apache.org/jira/browse/PDFBOX-956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993966#comment-12993966 ]
Andreas Lehmkühler commented on PDFBOX-956: ------------------------------------------- The provided pdf contains a lot of crap. There is a section with round about 21000 lines like the following (!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!)Tj That leads to round about 632000 "!" in the text (see the attached extraction result). That text is invisible because of its size, but it triggers the suppress duplicates algorithm od PDFBox which doesn't perform that good. > Poor text extraction performance in PDFTextStripper.java > -------------------------------------------------------- > > Key: PDFBOX-956 > URL: https://issues.apache.org/jira/browse/PDFBOX-956 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.4.0 > Reporter: Kevin Jackson > Fix For: 1.5.0 > > Attachments: PDFBOX956-c4ce2fcd_69.txt, PDFTextStripper.java.patch, > c4ce2fcd_69.pdf > > > The worst case performance of the suppressDuplicateOverlappingText logic in > processTextPosition is O(n^2). > The patch is to use a TreeMap to achieve O(N log N) performance. > The example PDF took over 2 hours to extract the text before this patch and > less than 10 minute after. > BTW: The extracted text is also quite different compared to Adobe Reader. > Not sure which is correct but for this document it doesn't matter. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira