[
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ray Gauss II resolved TIKA-1170.
--------------------------------
Resolution: Fixed
Fix Version/s: 1.5
Added in r1519664.
Thanks!
> Insufficiently specific magic for binary image/cgm files
> --------------------------------------------------------
>
> Key: TIKA-1170
> URL: https://issues.apache.org/jira/browse/TIKA-1170
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 1.4
> Reporter: Andrew Jackson
> Assignee: Ray Gauss II
> Priority: Minor
> Fix For: 1.5
>
> Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch,
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
> <match value="BEGMF" type="string" offset="0"/>
> <match value="0x0020" mask="0xffe0" type="string" offset="0"/>
> {code}
> The issue seems to be that the second magic matcher is not very specific,
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false
> matches out of >300 million resources, but it would be nice if this could be
> tightened up.
> Looking at the PRONOM signatures
> *
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> *
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> *
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> *
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each
> version. Therefore, a more robust signature should be:
> {code}
> <match value="BEGMF" type="string" offset="0"/>
> <match value="0x0020" mask="0xffe0" type="string" offset="0">
> <match value="0x10220001" type="string" offset="2:64"/>
> <match value="0x10220002" type="string" offset="2:64"/>
> <match value="0x10220003" type="string" offset="2:64"/>
> <match value="0x10220004" type="string" offset="2:64"/>
> </match>
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64
> characters long.
> Could this magic be considered for inclusion?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira