[ 
https://issues.apache.org/jira/browse/PDFBOX-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628556#comment-14628556
 ] 

John Hewson edited comment on PDFBOX-2882 at 7/15/15 7:19 PM:
--------------------------------------------------------------

{quote}
I've downloaded testPDF_childAttachments.pdf from PDFBOX-2856 and run 
PDDocument.load(File,useScratchFile) on it and queried page count. 
{quote}

Is getting the page count the only thing you're benchmarking? If so, it's not a 
representative use case. Try processing a multipage file with many streams, 
e.g. extracting text on a large file, or better yet rendering a file with lots 
of images.

I suppose the improvement for subsequent runs is due to JVM warmup. Try using 
different PDF files. The OS will do some caching too, but that should impact 
each technique equally, so it's not an issue.

{quote}
The pure scratch-file is nearly equally good as no-scratch-file
{quote}

Maybe you're hitting the OS' file cache? There's usually a write cache too. Try 
using multiple, larger PDFs.

Also, are you sure that you're giving the Java process isn't paging via virtual 
memory due to limited memory on your VM?


was (Author: jahewson):
{quote}
I've downloaded testPDF_childAttachments.pdf from PDFBOX-2856 and run 
PDDocument.load(File,useScratchFile) on it and queried page count. 
{quote}

Is getting the page count the only thing you're benchmarking? If so, it's not a 
representative use case. Try processing a multipage file with many streams, 
e.g. extracting text on a large file, or better yet rendering a file with lots 
of images.

I suppose the improvement for subsequent runs is due to JVM warmup. Try using 
different PDF files. The OS will do some caching too, but that should impact 
each technique equally, so it's not an issue.

{quote}
The pure scratch-file is nearly equally good as no-scratch-file
{quote}

Maybe you're hitting the OS' file cache? There's usually a write cache too. Try 
using multiple, larger PDFs.

> Improve performance when using scratch file
> -------------------------------------------
>
>                 Key: PDFBOX-2882
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2882
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 2.0.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>            Priority: Minor
>         Attachments: ScratchFile.java, ScratchFileBuffer.java
>
>
> The current scratch file implementation uses many direct I/O calls which 
> slows down parsing compared with in-memory scratch buffer considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to