[jira] [Comment Edited] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

Tim Allison (JIRA) Fri, 25 Jul 2014 04:45:08 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074306#comment-14074306
 ]


Tim Allison edited comment on TIKA-1375 at 7/25/14 11:43 AM:
-------------------------------------------------------------

I ran four versions of Tika against a random selection of 10k pdfs from 
govdocs1 to make sure that there wouldn't be any surprises if we added the 
three calls to clear().  These were all single-threaded runs on an 8GB linux vm.

The first run was the most recent SNAPSHOT (1.6 Baseline). The second run was 
after the three calls to clear() were added, but because embedded images are by 
default not extracted, the only one that was ever actually called was the one 
at the end of each page.  The third run was the 1.6 SNAPSHOT/Baseline with 
image extraction turned on, and the fourth was the clear() version with image 
extraction turned on.  

There were the same number of exceptions across all versions.  Within the 
"without image extraction" pairs, the number of metadata elements was exactly 
the same, and within the "with image extraction" pairs, the number of metadata 
elements was exactly the same. 

Adding .clear() improved speed when not extracting images and decreased speed 
by a much smaller amount (percentage-wise) when extracting images.

||Run||Average Millis||Median Millis||
|Tika 1.6 Baseline|272.5|113.0|
|Tika 1.6 Page.clear()|243.6|85.0|
|Tika 1.6 Baseline Image Extraction|861.5|120.5|
|Tika 1.6 Image Extraction w/ 3x clear()|888.2|124.0|



was (Author: [email protected]):
I ran four versions of Tika against a random selection of 10k pdfs from 
govdocs1 to make sure that there wouldn't be any surprises if we added the 
three calls to clear().  These were all single-threaded runs on an 8GB linux vm.

The first run was the most recent SNAPSHOT (1.6 Baseline). The second run was 
after the three calls to clear() were added, but because embedded images are by 
default not extracted, the only one that was ever actually called was the one 
at the end of each page.  The third run was the 1.6 SNAPSHOT/Baseline with 
image extraction turned on, and the fourth was the clear() version with image 
extraction turned on.  

There were the same number of exceptions across all versions.  Within the 
"without image extraction" pairs, the number of metadata elements was exactly 
the same, and within the "with image extraction" pairs, the number of metadata 
elements was exactly the same. 

Adding .clear() improved speed when not extracting images and decreased speed 
by a much smaller amount when extracting images.

||Run||Average Millis||Median Millis||
|Tika 1.6 Baseline|272.5|113.0|
|Tika 1.6 Page.clear()|243.6|85.0|
|Tika 1.6 Baseline Image Extraction|861.5|120.5|
|Tika 1.6 Image Extraction w/ 3x clear()|888.2|124.0|


> Decrease memory consumption when extracting images from PDFs
> ------------------------------------------------------------
>
>                 Key: TIKA-1375
>                 URL: https://issues.apache.org/jira/browse/TIKA-1375
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.6
>
>
> This patch applies changes made in PDFBOX-2101 to decrease memory consumption 
> during extraction of embedded images.  This also applies the recommendation 
> by [~tilman] on the PDFBox dev [list | 
> http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201407.mbox/%[email protected]%3e]
>  to clear resources after handling each page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1375) Decrease memory consumption when extracting images from PDFs

Reply via email to