[
https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann updated NUTCH-562:
------------------------------------
Attachment: tika-0.1-dev.jar
Tika 0.1 unrelased jar file -- drop this in $NUTCH_SRC_HOME/lib
> Port mime type framework to use Tika mime detection framework
> -------------------------------------------------------------
>
> Key: NUTCH-562
> URL: https://issues.apache.org/jira/browse/NUTCH-562
> Project: Nutch
> Issue Type: Improvement
> Components: mime_type_detector
> Affects Versions: 1.0.0
> Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS
> X 10.4 although improvement is indep of env
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
> Attachments: NUTCH-562.Mattmann.patch.txt, tika-0.1-dev.jar
>
>
> With Tika (http://incubator.apache.org/tika/) nearing a stable 0.1 release
> candidate, I think it would be a good time to patch Nutch to use Tika's mime
> detection system (an improvement over the existing Nutch one written
> primarily by Jerome). Tika's mime system is based on the mime system from
> Freedesktop.org and includes several improvements over the existing Nutch
> mime system such as:
> 1. reliable XML-based content detection (a clear issue plaguing Nutch for
> some time now), ability to delineate between RSS, XML, ATOM, etc.
> 2. mime magic pattern matching, including support for multiple patterns
> 3. glob pattern matches (ability to support > 1)
> I'll get together a patch and then attach it to the list once it's relatively
> stable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.