Tim Allison created TIKA-4220:
---------------------------------

             Summary: Commons-compress too lenient on headless tar detection
                 Key: TIKA-4220
                 URL: https://issues.apache.org/jira/browse/TIKA-4220
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


On recent regression tests on TIKA-4218, we noticed a fairly major change with 
an increased rate of false positives on headless tar detection from 
commons-compress.

I think for now we should copy/paste/fork the headless tar detection and 
improve it/revert it or possibly remove it for our 2.9.2 release.

On this ticket, I'll look into what changed recently in headless tar detection 
in commons-compress and experiment with fixing it.

One challenge is that our magic bytes detection happens _after_ our custom 
detectors, which means that we can't put a low confidence on what comes out of 
our custom detectors and let the magic detection fix it. We could  implement an 
x-tar special case, but I really don't like that.

Let's see what we can do...

The numbers below represent the number of files identified as A (in tika 2.9.1) 
-> B (in tika-2.9.2-pre-rc1).


application/octet-stream -> application/x-tar   826
multipart/appledouble -> application/x-tar      701
image/x-tga -> application/x-tar        322
image/vnd.microsoft.icon -> application/x-tar   312
application/vnd.iccprofile -> application/x-tar 221
video/mp4 -> application/x-tar  177
audio/mpeg -> application/x-tar 59
video/x-m4v -> application/x-tar        59
application/x-font-printer-metric -> application/x-tar  36
audio/mp4 -> application/x-tar  25
application/x-tex-tfm -> application/x-tar      18
image/x-pict -> application/x-tar       15
image/png -> application/x-tar  8
text/plain; charset=ISO-8859-1 -> application/x-tar     8
application/x-endnote-style -> application/x-tar        7
application/x-font-ttf -> application/x-tar     6




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to