Port mime type framework to use Tika mime detection framework
-------------------------------------------------------------
Key: NUTCH-562
URL: https://issues.apache.org/jira/browse/NUTCH-562
Project: Nutch
Issue Type: Improvement
Components: mime_type_detector
Affects Versions: 1.0.0
Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS
X 10.4 although improvement is indep of env
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
With Tika (http://incubator.apache.org/tika/) nearing a stable 0.1 release
candidate, I think it would be a good time to patch Nutch to use Tika's mime
detection system (an improvement over the existing Nutch one written primarily
by Jerome). Tika's mime system is based on the mime system from Freedesktop.org
and includes several improvements over the existing Nutch mime system such as:
1. reliable XML-based content detection (a clear issue plaguing Nutch for some
time now), ability to delineate between RSS, XML, ATOM, etc.
2. mime magic pattern matching, including support for multiple patterns
3. glob pattern matches (ability to support > 1)
I'll get together a patch and then attach it to the list once it's relatively
stable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.