[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756051#comment-13756051
 ] 

Andrew Jackson commented on TIKA-1170:
--------------------------------------

My corpus is a chunk of the Internet Archive, so you can look at the CGM's I'm 
finding:

* [all 
copies|http://web.archive.org/web/20000401000000*/http://www.agocg.ac.uk/Graphics/CGM/RALCGM/sample.cgm],
 or a [specific copy| 
http://web.archive.org/web/20000226055607/http://www.agocg.ac.uk/Graphics/CGM/RALCGM/sample.cgm].
** Those example files now seem to be at 
http://www.agocg.ac.uk/train/cgm/examples/cgmindex.htm
* or [this specific 
item|http://web.archive.org/web/20050223100939/http://wwwcms.brookes.ac.uk:80/webmsc2004/p00770/cgms/flyboat.cgm]
 from [this folder 
here|http://web.archive.org/web/20050112031156/http://wwwcms.brookes.ac.uk/webmsc2004/p00770/cgms/]
* I also found these, but have not checked if any are binary 
http://www.fileformat.info/format/cgm/sample/index.htm

Unfortunately,the licensing may not be clear in these cases, so these test 
files may not be suitable. If anyone knows of any software that can write 
binary CGM files, I'm willing to give it a go.
                
> Insufficiently specific magic for binary image/cgm files
> --------------------------------------------------------
>
>                 Key: TIKA-1170
>                 URL: https://issues.apache.org/jira/browse/TIKA-1170
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.4
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>       <match value="BEGMF" type="string" offset="0"/>
>       <match value="0x0020" mask="0xffe0" type="string" offset="0"/>
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>       <match value="BEGMF" type="string" offset="0"/>
>       <match value="0x0020" mask="0xffe0" type="string" offset="0">
>         <match value="0x10220001" type="string" offset="2:64"/>
>         <match value="0x10220002" type="string" offset="2:64"/>
>         <match value="0x10220003" type="string" offset="2:64"/>
>         <match value="0x10220004" type="string" offset="2:64"/>
>       </match>
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to