[ https://issues.apache.org/jira/browse/PDFBOX-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roger HÃ¥kansson updated PDFBOX-1305: ------------------------------------ Attachment: 20020101ab3x012a.pdf > Text extraction takes huge amount of time on some files > ------------------------------------------------------- > > Key: PDFBOX-1305 > URL: https://issues.apache.org/jira/browse/PDFBOX-1305 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.6.0 > Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. > Same result with JDK 7u4 and JDK 6u32 > Reporter: Roger HÃ¥kansson > Attachments: 20020101ab3x012a.pdf > > > I've got 1.2M single-page PDF files which I'm indexing using Solr (which is > using Tika, which is using PDFBox) and some of them takes between 20min up to > an hour to index. > This is a huge problem for me, in 48hours I've indexed about 45k files and 19 > hours of that time was spent on just 279 files. > I've traced it to PDFBox taking a lot of time extracting the text from the > documents. > I've tested extracting the text using pdfbox-app's ExtractText with the same > result, the text is extracted but it takes forever... > The attached file took about 23min (using ExtractText) and from the result I > can see a lot of "rubbish text" which I don't see in the text extracted from > files that takes a normal amount of time (up to a few seconds per file) to > parse. > When running truss (on Solaris, strace on Linux) on the java-process, I can > see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this > problem but I just want to mention it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira