[ 
https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-562.
-------------------------------------

    Resolution: Fixed

- Applied patch, with minor changes to use static version of MimeUtils Tika 
interface, and to only instantiate once per object family
- Tested on small crawl of apache.org sites, mime type set appropriately

> Port mime type framework to use Tika mime detection framework
> -------------------------------------------------------------
>
>                 Key: NUTCH-562
>                 URL: https://issues.apache.org/jira/browse/NUTCH-562
>             Project: Nutch
>          Issue Type: Improvement
>          Components: mime_type_detector
>    Affects Versions: 1.0.0
>         Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS 
> X 10.4 although improvement is indep of env
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-562.Mattmann.patch.txt, tika-0.1-dev.jar
>
>
> With Tika (http://incubator.apache.org/tika/) nearing  a stable 0.1 release 
> candidate, I think it would be a good time to patch Nutch to use Tika's mime 
> detection system (an improvement over the existing Nutch one written 
> primarily by Jerome). Tika's mime system is based on the mime system from 
> Freedesktop.org and includes several improvements over the existing Nutch 
> mime system such as:
> 1. reliable XML-based content detection (a clear issue plaguing Nutch for 
> some time now), ability to delineate between RSS, XML, ATOM, etc.
> 2. mime magic pattern matching, including support for multiple patterns
> 3. glob pattern matches (ability to support > 1)
> I'll get together a patch and then attach it to the list once it's relatively 
> stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to