Folks, Either way is fine with me. I committed the patch for the following reasons:
1. Though the patch sat for around 36 hrs, the JIRA issue has been around nearly 2 weeks, without any comment at all. I used this as a baseline for relative interest in the patch. Though a patch file is ultimately the means for which contributions are to be judged, I had pretty much laid out the plan in the JIRA issue: port Nutch to use Tika mime system. Tika mime system provides X, Y, Z that Nutch doesn't, etc. This described the ultimate intent of the code that was soon to be reified. 2. Similarity of Tika mime API to existing Nutch mime API. The core classes of the API in both Tika and the old mime system in Nutch are 90% the same (in some cass, like MimeTypes.java, the file is nearly identical). This fact is not incidental: it's because Jerome wrote the majority of both code bases. This made it easier for me to swallow that the API would work as expected. 3. My experience testing the patch in the case of small crawls against subsets of the apache.org sites. I was primarily looking for 2 things: a. performance -- there wasn't a significant hit that I could notice while observing crawl time anecdotally. b. effectiveness -- were mime types still being set in the metadata, were the right parsers getting called, etc.? The answer here was "yes". I'm sure that this is more of a procedural issue than anything else. Because of this I'm happy to revert the patch. My +1 for it in fact. Then I'll happily await other folks to test it and provide feedback. I can't promise I'll get to updating it and committing revised versions of it back to the sources right away though: the rest of my week is actually very busy (another reason for my desire to contribute the patch and commit it over the past weekend -- it was the only time in the next week or so that I would have to get it into the sources and to solve some issues that have been plaguing Nutch for a while, e.g., reliable content type detection in the case of XML/RDF/RSS files, etc.). In any case, let me know what you decide. Chris On 10/9/07 1:57 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: > Chris A. Mattmann (JIRA) wrote: >> [ >> https://issues.apache.org/jira/browse/NUTCH-562?page=com.atlassian.jira.plugi >> n.system.issuetabpanels:all-tabpanel ] >> >> Chris A. Mattmann closed NUTCH-562. >> ----------------------------------- >> >> >> - Patch applied to trunk in r583016 > > I think this issue didn't get enough attention before it was committed. > I agree with the direction of this patch - functionality-wise the mime > type detector in Tika is clearly superior to the one that we have now in > Nutch - but I feel that the use of an external framework, which is not > yet released, should be discussed first, and the proper working of the > patch should be confirmed by other users. There was too little time to > do this before the commit. > > I vote for reverting this patch, unless there is an overall consensus > among Nutch developers that it's ok to keep it as it is - on one hand > considering the added functionality and simplification of Nutch code, > and on the other hand considering the (lack of) maturity of Tika.
