[jira] [Created] (PDFBOX-3429) Improve ExtractText Concurrency

Luis Filipe Nassif (JIRA) Mon, 18 Jul 2016 19:04:28 -0700

Luis Filipe Nassif created PDFBOX-3429:
------------------------------------------


             Summary: Improve ExtractText Concurrency
                 Key: PDFBOX-3429
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3429
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 2.0.1
         Environment: Win7, jdk1.9.0_60 x64
            Reporter: Luis Filipe Nassif
            Priority: Minor


While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text 
extraction application, I noted cpu usage aroung 80% in my 6 core computer when 
processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec to 
complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, cpu 
usage stays aroung 100%. It took 4min37sec to complete. The dataset is read 
from a ramdrive, so there is no i/o bottleneck. I suspect there is some new 
synchronization code that blocks the threads for a non trivial amount of time, 
resulting in less cpu usage than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (PDFBOX-3429) Improve ExtractText Concurrency

Reply via email to