[jira] [Updated] (PDFBOX-1305) Text extraction takes huge amount of time on some files

JIRA Wed, 09 May 2012 08:02:14 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roger Håkansson updated PDFBOX-1305:
------------------------------------

    Attachment: 20020101ab3x012a.pdf
    
> Text extraction takes huge amount of time on some files
> -------------------------------------------------------
>
>                 Key: PDFBOX-1305
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1305
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Same phenomena on Windows 7, Solaris 10 and CentOS 5.7. 
> Same result with JDK 7u4 and JDK 6u32
>            Reporter: Roger Håkansson
>         Attachments: 20020101ab3x012a.pdf
>
>
> I've got 1.2M single-page PDF files which I'm indexing using Solr (which is 
> using Tika, which is using PDFBox) and some of them takes between 20min up to 
> an hour to index.
> This is a huge problem for me, in 48hours I've indexed about 45k files and 19 
> hours of that time was spent on just 279 files.
> I've traced it to PDFBox taking a lot of time extracting the text from the 
> documents.
> I've tested extracting the text using pdfbox-app's ExtractText with the same 
> result, the text is extracted but it takes forever...
> The attached file took about 23min (using ExtractText) and from the result I 
> can see a lot of "rubbish text" which I don't see in the text extracted from 
> files that takes a normal amount of time (up to a few seconds per file) to 
> parse.
> When running truss (on Solaris, strace on Linux) on the java-process, I can 
> see a lot of SEGV due to FLTBOUNDS, which I don't know if its related to this 
> problem but I just want to mention it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1305) Text extraction takes huge amount of time on some files

Reply via email to