[jira] [Created] (TIKA-3067) Different numbers of embedded inline images with PDF inline image extraction code

Tim Allison (Jira) Mon, 09 Mar 2020 13:33:53 -0700

Tim Allison created TIKA-3067:
---------------------------------

             Summary: Different numbers of embedded inline images with PDF 
inline image extraction code
                 Key: TIKA-3067
                 URL: https://issues.apache.org/jira/browse/TIKA-3067
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison



I ran extract inline images on a local sample of 20k files of common crawl and 
govdocs1.

These are embedded files missing in 1.23 when compared with 1.24-pre-rc1:
||MIME_STRING||CNT||
|image/png|175,413|
|image/tiff|59,507|
|image/jpeg|6,435|
|image/x-jbig2|4,998|
|image/jp2|4,573|
|image/x-jp2-codestream|1|

This would look like we're gaining ~175k png files with the new 
method...However, in other files, it looks like we're losing a bunch of files 
as well.

These are embedded files missing in 1.24-pre-rc1
|MIME_STRING||CNT||
|image/png|105,885|
|image/tiff|55,636|
|image/jpeg|3,289|
|image/x-jbig2|291|
|text/plain; charset=windows-1252|2|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3067) Different numbers of embedded inline images with PDF inline image extraction code

Reply via email to