[jira] [Commented] (TIKA-3067) Different numbers of embedded inline images with PDF inline image extraction code

Tilman Hausherr (Jira) Mon, 09 Mar 2020 14:48:53 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055402#comment-17055402
 ]


Tilman Hausherr commented on TIKA-3067:
---------------------------------------

I think my test was with an old version. I'm not fully up to date due to not 
being fully familiar with git. But I saw the code before updating - the old 
code went through the resources, the new code uses the strategy of PDFBox. The 
PDF file has the image masks separately in the resource dictionary despite not 
being used by the content stream. In the first page, this is Im764.

> Different numbers of embedded inline images with PDF inline image extraction 
> code
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-3067
>                 URL: https://issues.apache.org/jira/browse/TIKA-3067
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: 437698_tika_1_23.tgz, 437698_tika_1_24.tgz, 
> attachment_diffs_with_exceptions.xlsx
>
>
> I ran extract inline images on a local sample of 20k files of common crawl 
> and govdocs1.
> These are embedded files missing in 1.23 when compared with 1.24-pre-rc1:
> ||MIME_STRING||CNT||
> |image/png|175,413|
> |image/tiff|59,507|
> |image/jpeg|6,435|
> |image/x-jbig2|4,998|
> |image/jp2|4,573|
> |image/x-jp2-codestream|1|
> This would look like we're gaining ~175k png files with the new 
> method...However, in other files, it looks like we're losing a bunch of 
> embedded images as well.
> These are embedded files missing in 1.24-pre-rc1
> |MIME_STRING||CNT||
> |image/png|105,885|
> |image/tiff|55,636|
> |image/jpeg|3,289|
> |image/x-jbig2|291|
> |text/plain; charset=windows-1252|2|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3067) Different numbers of embedded inline images with PDF inline image extraction code

Reply via email to