[
https://issues.apache.org/jira/browse/PDFBOX-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628556#comment-14628556
]
John Hewson edited comment on PDFBOX-2882 at 7/15/15 7:19 PM:
--------------------------------------------------------------
{quote}
I've downloaded testPDF_childAttachments.pdf from PDFBOX-2856 and run
PDDocument.load(File,useScratchFile) on it and queried page count.
{quote}
Is getting the page count the only thing you're benchmarking? If so, it's not a
representative use case. Try processing a multipage file with many streams,
e.g. extracting text on a large file, or better yet rendering a file with lots
of images.
I suppose the improvement for subsequent runs is due to JVM warmup. Try using
different PDF files. The OS will do some caching too, but that should impact
each technique equally, so it's not an issue.
{quote}
The pure scratch-file is nearly equally good as no-scratch-file
{quote}
Maybe you're hitting the OS' file cache? There's usually a write cache too. Try
using multiple, larger PDFs.
Also, are you sure that you're giving the Java process isn't paging via virtual
memory due to limited memory on your VM?
was (Author: jahewson):
{quote}
I've downloaded testPDF_childAttachments.pdf from PDFBOX-2856 and run
PDDocument.load(File,useScratchFile) on it and queried page count.
{quote}
Is getting the page count the only thing you're benchmarking? If so, it's not a
representative use case. Try processing a multipage file with many streams,
e.g. extracting text on a large file, or better yet rendering a file with lots
of images.
I suppose the improvement for subsequent runs is due to JVM warmup. Try using
different PDF files. The OS will do some caching too, but that should impact
each technique equally, so it's not an issue.
{quote}
The pure scratch-file is nearly equally good as no-scratch-file
{quote}
Maybe you're hitting the OS' file cache? There's usually a write cache too. Try
using multiple, larger PDFs.
> Improve performance when using scratch file
> -------------------------------------------
>
> Key: PDFBOX-2882
> URL: https://issues.apache.org/jira/browse/PDFBOX-2882
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing
> Affects Versions: 2.0.0
> Reporter: Timo Boehme
> Assignee: Timo Boehme
> Priority: Minor
> Attachments: ScratchFile.java, ScratchFileBuffer.java
>
>
> The current scratch file implementation uses many direct I/O calls which
> slows down parsing compared with in-memory scratch buffer considerably.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]