[ 
https://issues.apache.org/jira/browse/TIKA-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820020#comment-16820020
 ] 

Tim Allison commented on TIKA-2853:
-----------------------------------

Table of entry names and some entry metadata from our ~500k zips:
http://162.242.228.174/share/zips.txt.gz

> Consider applying NaiveBayes or similar simple ML to streaming zip detector
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2853
>                 URL: https://issues.apache.org/jira/browse/TIKA-2853
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> Whether we use actual ml or build rules from patterns we see in the data, it 
> would be useful to gather features from field names, directory names, etc of 
> zipfile-based file types from our regression corpus to (potentially) improve 
> the efficiency of mime detection. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to