Poor text extraction performance in PDFTextStripper.java --------------------------------------------------------
Key: PDFBOX-956 URL: https://issues.apache.org/jira/browse/PDFBOX-956 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.4.0 Reporter: Kevin Jackson Fix For: 1.5.0 The worst case performance of the suppressDuplicateOverlappingText logic in processTextPosition is O(n^2). The patch is to use a TreeMap to achieve O(N log N) performance. The example PDF took over 2 hours to extract the text before this patch and less than 10 minute after. BTW: The extracted text is also quite different compared to Adobe Reader. Not sure which is correct but for this document it doesn't matter. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira