Luis Filipe Nassif created PDFBOX-3429:
------------------------------------------
Summary: Improve ExtractText Concurrency
Key: PDFBOX-3429
URL: https://issues.apache.org/jira/browse/PDFBOX-3429
Project: PDFBox
Issue Type: Improvement
Components: Text extraction
Affects Versions: 2.0.1
Environment: Win7, jdk1.9.0_60 x64
Reporter: Luis Filipe Nassif
Priority: Minor
While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text
extraction application, I noted cpu usage aroung 80% in my 6 core computer when
processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec to
complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, cpu
usage stays aroung 100%. It took 4min37sec to complete. The dataset is read
from a ramdrive, so there is no i/o bottleneck. I suspect there is some new
synchronization code that blocks the threads for a non trivial amount of time,
resulting in less cpu usage than before.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]