[
https://issues.apache.org/jira/browse/TIKA-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055402#comment-17055402
]
Tilman Hausherr commented on TIKA-3067:
---------------------------------------
I think my test was with an old version. I'm not fully up to date due to not
being fully familiar with git. But I saw the code before updating - the old
code went through the resources, the new code uses the strategy of PDFBox. The
PDF file has the image masks separately in the resource dictionary despite not
being used by the content stream. In the first page, this is Im764.
> Different numbers of embedded inline images with PDF inline image extraction
> code
> ---------------------------------------------------------------------------------
>
> Key: TIKA-3067
> URL: https://issues.apache.org/jira/browse/TIKA-3067
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: 437698_tika_1_23.tgz, 437698_tika_1_24.tgz,
> attachment_diffs_with_exceptions.xlsx
>
>
> I ran extract inline images on a local sample of 20k files of common crawl
> and govdocs1.
> These are embedded files missing in 1.23 when compared with 1.24-pre-rc1:
> ||MIME_STRING||CNT||
> |image/png|175,413|
> |image/tiff|59,507|
> |image/jpeg|6,435|
> |image/x-jbig2|4,998|
> |image/jp2|4,573|
> |image/x-jp2-codestream|1|
> This would look like we're gaining ~175k png files with the new
> method...However, in other files, it looks like we're losing a bunch of
> embedded images as well.
> These are embedded files missing in 1.24-pre-rc1
> |MIME_STRING||CNT||
> |image/png|105,885|
> |image/tiff|55,636|
> |image/jpeg|3,289|
> |image/x-jbig2|291|
> |text/plain; charset=windows-1252|2|
--
This message was sent by Atlassian Jira
(v8.3.4#803005)