[ 
https://issues.apache.org/jira/browse/TIKA-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820042#comment-16820042
 ] 

Tim Allison commented on TIKA-2853:
-----------------------------------

@beet_keeper on Twitter 
(https://twitter.com/beet_keeper/status/1118492595406110725) recommended: 
https://www.nationalarchives.gov.uk/documents/container-signature-20180920.xml

as a useful resource for container zip and ole2 file patterns


> Consider applying NaiveBayes or similar simple ML to streaming zip detector
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2853
>                 URL: https://issues.apache.org/jira/browse/TIKA-2853
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Whether we use actual ml or build rules from patterns we see in the data, it 
> would be useful to gather features from field names, directory names, etc of 
> zipfile-based file types from our regression corpus to (potentially) improve 
> the efficiency of mime detection. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to