[ 
https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532972
 ] 

chrismattmann edited comment on NUTCH-562 at 10/7/07 8:34 AM:
------------------------------------------------------------------

Initial patch for comments:

1. This patch removes the MimeType system, and its associated java src files, 
config files and unit tests from Nutch. This information is in Tika now and is 
replaced by its TIka counterparts.
2. This patch uses the unreleased 0.1-dev version of Tika. When 0.1 is 
officially released, we can convert to that, though I don't anticipate any 
MimeType API changes between now and then.
3. All unit tests for core and plugins pass, however, it's probably a good idea 
to run at least a small crawl with this patch and see if everything works fine. 
I don't really have the time for this now, so anyone want to try? (cough cough 
Dogacan cough cough ;) )
4. It's worth noting that this MimeType system from Tika changes the 
traditional Nutch mime type system (IMO for the better) in a couple of ways. 
First, whereas the old MimeType system was very happy to return null in places 
where it couldn't figure out the MimeType, this system tries to return a 
"default" MimeType (which in this case is "application/octet-stream") if it 
can't guess the mime type from those that it knows about. Second, this mime 
type system uses a different type of XML repo file -- based on the one 
available from freedesktop.org's shared MIME package. 

Okay, so if someone gets a chance please run a small crawl with this in the 
next few days and let us know how it works. Otherwise, I'll do the same myself 
in a couple days and if there are no objections, I'd like to commit this then.

      was (Author: chrismattmann):
    Initial patch for comments:

1. This patch removes the MimeType system, and its associated java src files, 
config files and unit tests from Nutch. This information is in Tika now and is 
replaced by its TIka counterparts.
2. This patch uses the unreleased 0.1-dev version of Tika. When 0.1 is 
officially released, we can convert to that, though I don't anticipate any 
MimeType API changes between now and then.
3. All unit tests for core and plugins pass, however, it's probably a good idea 
to run at least a small crawl with this patch and see if everything works fine. 
I don't really have the time for this now, so anyone want to try? (cough cough 
Dougacan cough cough ;) )
4. It's worth noting that this MimeType system from Tika changes the 
traditional Nutch mime type system (IMO for the better) in a couple of ways. 
First, whereas the old MimeType system was very happy to return null in places 
where it couldn't figure out the MimeType, this system tries to return a 
"default" MimeType (which in this case is "application/octet-stream") if it 
can't guess the mime type from those that it knows about. Second, this mime 
type system uses a different type of XML repo file -- based on the one 
available from freedesktop.org's shared MIME package. 

Okay, so if someone gets a chance please run a small crawl with this in the 
next few days and let us know how it works. Otherwise, I'll do the same myself 
in a couple days and if there are no objections, I'd like to commit this then.
  
> Port mime type framework to use Tika mime detection framework
> -------------------------------------------------------------
>
>                 Key: NUTCH-562
>                 URL: https://issues.apache.org/jira/browse/NUTCH-562
>             Project: Nutch
>          Issue Type: Improvement
>          Components: mime_type_detector
>    Affects Versions: 1.0.0
>         Environment: Mac Book Pro, Intel Core Duo 2.0 Ghz, 2.0 GB RAM, Mac OS 
> X 10.4 although improvement is indep of env
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: NUTCH-562.Mattmann.patch.txt, tika-0.1-dev.jar
>
>
> With Tika (http://incubator.apache.org/tika/) nearing  a stable 0.1 release 
> candidate, I think it would be a good time to patch Nutch to use Tika's mime 
> detection system (an improvement over the existing Nutch one written 
> primarily by Jerome). Tika's mime system is based on the mime system from 
> Freedesktop.org and includes several improvements over the existing Nutch 
> mime system such as:
> 1. reliable XML-based content detection (a clear issue plaguing Nutch for 
> some time now), ability to delineate between RSS, XML, ATOM, etc.
> 2. mime magic pattern matching, including support for multiple patterns
> 3. glob pattern matches (ability to support > 1)
> I'll get together a patch and then attach it to the list once it's relatively 
> stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to