[
https://issues.apache.org/jira/browse/TIKA-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820020#comment-16820020
]
Tim Allison commented on TIKA-2853:
-----------------------------------
Table of entry names and some entry metadata from our ~500k zips:
http://162.242.228.174/share/zips.txt.gz
> Consider applying NaiveBayes or similar simple ML to streaming zip detector
> ---------------------------------------------------------------------------
>
> Key: TIKA-2853
> URL: https://issues.apache.org/jira/browse/TIKA-2853
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Whether we use actual ml or build rules from patterns we see in the data, it
> would be useful to gather features from field names, directory names, etc of
> zipfile-based file types from our regression corpus to (potentially) improve
> the efficiency of mime detection.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)