[ 
https://issues.apache.org/jira/browse/PDFBOX-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390113#comment-15390113
 ] 

Tilman Hausherr edited comment on PDFBOX-3429 at 7/22/16 8:04 PM:
------------------------------------------------------------------

You have 16 vertical squares so it is more difficult to see, but I'd say you're 
at about 75%. I'm at about 90% with 2.0.3.

I just tried Yourkit, but wasn't able to find anything. All problems that are 
shown by the monitor window deal with java itself, not with pdfbox. But I'd 
recommend you try it yourself too.


was (Author: tilman):
You have 16 vertical squares so it is more difficult to see, but I'd say you're 
at about 75%. I'm at about 90% with 2.0.3.

I just tried Yourkit, but wasn't able to find anything. All problems that are 
shown by the monitor window deal with java itself, not with pdfbox.

> Improve ExtractText Concurrency
> -------------------------------
>
>                 Key: PDFBOX-3429
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3429
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 2.0.1
>         Environment: Win7, jdk1.8.0_60 x64
>            Reporter: Luis Filipe Nassif
>            Priority: Minor
>              Labels: optimization
>         Attachments: 000000000000B265.pdf, cpu-pdfbox-2.0.2.png, 
> cpu-pdfbox1.8.10.png, cpu_pdfbox_2.0.3_and_1.8.10.png, tilman-combined-cpu.png
>
>
> While testing Tika 1.13, which uses PDFBox 2.0.1, from a multithreaded text 
> extraction application, I noted cpu usage aroung 80% in my 6 core computer 
> when processing a dataset of ~75 thousands of pdfs (18GB). It took 5min25sec 
> to complete the text extraction. With Tika 1.10, which uses PDFBox 1.8.10, 
> cpu usage stays aroung 100%. It took 4min37sec to complete. The dataset is 
> read from a ramdrive, so there is no i/o bottleneck. I suspect there is some 
> new synchronization code that blocks the threads for a non trivial amount of 
> time, resulting in less cpu usage than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to