[ 
https://issues.apache.org/jira/browse/TIKA-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gregory Lepore updated TIKA-4074:
---------------------------------
    Description: 
The TeX Virtual Font format occurs 6,047 times in the second most recent Common 
Crawl dataset (and over 3000 in the latest set). No known mime type. The magic 
is:

 

F7CA\{9}F300\{4}0010 at offset 0.

 

The above signature will catch most TeX vf files, however some will be missed. 
However, there were no false positives so I think it's a good compromise to 
catch the majority of sample files.

 

It would be nice to see the results of additional testing.

  was:
The TeX Virtual Font format occurs 6,047 times in the second most recent Common 
Crawl dataset. No known mime type. The magic is:

 

F7CA\{9}F300\{4}0010 at offset 0.

 

The above signature will catch most TeX vf files, however some will be missed. 
However, there were no false positives so I think it's a good compromise to 
catch the majority of sample files.

 

It would be nice to see the results of additional testing.


> Add magic for TeX Virtual Font format
> -------------------------------------
>
>                 Key: TIKA-4074
>                 URL: https://issues.apache.org/jira/browse/TIKA-4074
>             Project: Tika
>          Issue Type: Sub-task
>            Reporter: Gregory Lepore
>            Priority: Minor
>         Attachments: aebx10.vf, aebx12.vf, aebxsl10.vf
>
>
> The TeX Virtual Font format occurs 6,047 times in the second most recent 
> Common Crawl dataset (and over 3000 in the latest set). No known mime type. 
> The magic is:
>  
> F7CA\{9}F300\{4}0010 at offset 0.
>  
> The above signature will catch most TeX vf files, however some will be 
> missed. However, there were no false positives so I think it's a good 
> compromise to catch the majority of sample files.
>  
> It would be nice to see the results of additional testing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to