[
https://issues.apache.org/jira/browse/TIKA-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074306#comment-14074306
]
Tim Allison edited comment on TIKA-1375 at 7/25/14 11:43 AM:
-------------------------------------------------------------
I ran four versions of Tika against a random selection of 10k pdfs from
govdocs1 to make sure that there wouldn't be any surprises if we added the
three calls to clear(). These were all single-threaded runs on an 8GB linux vm.
The first run was the most recent SNAPSHOT (1.6 Baseline). The second run was
after the three calls to clear() were added, but because embedded images are by
default not extracted, the only one that was ever actually called was the one
at the end of each page. The third run was the 1.6 SNAPSHOT/Baseline with
image extraction turned on, and the fourth was the clear() version with image
extraction turned on.
There were the same number of exceptions across all versions. Within the
"without image extraction" pairs, the number of metadata elements was exactly
the same, and within the "with image extraction" pairs, the number of metadata
elements was exactly the same.
Adding .clear() improved speed when not extracting images and decreased speed
by a much smaller amount (percentage-wise) when extracting images.
||Run||Average Millis||Median Millis||
|Tika 1.6 Baseline|272.5|113.0|
|Tika 1.6 Page.clear()|243.6|85.0|
|Tika 1.6 Baseline Image Extraction|861.5|120.5|
|Tika 1.6 Image Extraction w/ 3x clear()|888.2|124.0|
was (Author: [email protected]):
I ran four versions of Tika against a random selection of 10k pdfs from
govdocs1 to make sure that there wouldn't be any surprises if we added the
three calls to clear(). These were all single-threaded runs on an 8GB linux vm.
The first run was the most recent SNAPSHOT (1.6 Baseline). The second run was
after the three calls to clear() were added, but because embedded images are by
default not extracted, the only one that was ever actually called was the one
at the end of each page. The third run was the 1.6 SNAPSHOT/Baseline with
image extraction turned on, and the fourth was the clear() version with image
extraction turned on.
There were the same number of exceptions across all versions. Within the
"without image extraction" pairs, the number of metadata elements was exactly
the same, and within the "with image extraction" pairs, the number of metadata
elements was exactly the same.
Adding .clear() improved speed when not extracting images and decreased speed
by a much smaller amount when extracting images.
||Run||Average Millis||Median Millis||
|Tika 1.6 Baseline|272.5|113.0|
|Tika 1.6 Page.clear()|243.6|85.0|
|Tika 1.6 Baseline Image Extraction|861.5|120.5|
|Tika 1.6 Image Extraction w/ 3x clear()|888.2|124.0|
> Decrease memory consumption when extracting images from PDFs
> ------------------------------------------------------------
>
> Key: TIKA-1375
> URL: https://issues.apache.org/jira/browse/TIKA-1375
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 1.6
>
>
> This patch applies changes made in PDFBOX-2101 to decrease memory consumption
> during extraction of embedded images. This also applies the recommendation
> by [~tilman] on the PDFBox dev [list |
> http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201407.mbox/%[email protected]%3e]
> to clear resources after handling each page.
--
This message was sent by Atlassian JIRA
(v6.2#6252)