Port mime type framework to use Tika mime detection framework
-------------------------------------------------------------

                 Key: NUTCH-562
                 URL: https://issues.apache.org/jira/browse/NUTCH-562
             Project: Nutch
          Issue Type: Improvement
          Components: mime_type_detector
    Affects Versions: 1.0.0
         Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS 
X 10.4 although improvement is indep of env
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
            Priority: Minor


With Tika (http://incubator.apache.org/tika/) nearing  a stable 0.1 release 
candidate, I think it would be a good time to patch Nutch to use Tika's mime 
detection system (an improvement over the existing Nutch one written primarily 
by Jerome). Tika's mime system is based on the mime system from Freedesktop.org 
and includes several improvements over the existing Nutch mime system such as:

1. reliable XML-based content detection (a clear issue plaguing Nutch for some 
time now), ability to delineate between RSS, XML, ATOM, etc.
2. mime magic pattern matching, including support for multiple patterns
3. glob pattern matches (ability to support > 1)

I'll get together a patch and then attach it to the list once it's relatively 
stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to