Hi Doug, I just noticed this comment from your original email:
> First, the ParserFactory sometimes uses LOG.severe() which causes the > Fetcher to exit. Is there a reason this cannot be LOG.warning()? > LOG.severe() should only be used if you intend the application to exit. > This configuration problem does not seem to warrant that. And I'm getting > it with the default settings when an application/pdf is encountered. In fact, I can't speak for Jerome and Sebastien, but I actually intended the application to exit in this case. Here is a snippet, taken from: http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/ /////////////////////////////////////////////////////////////////////////// If an activated parse plugin is not listed in the parse-plugins.xml, then it won't get called for parsing. The purpose of the parse-plugins.xml file would be to map parsing-plugin to contentType. Therefore, if an activated plugin is not mapped to a content type, then it is "activated", but won't get called. This is very similar to Apache HTTPD. See below: //httpd.conf example //add handler for php LoadModule php4_module libexec/httpd/libphp4.so // map handler to mimeType AddType application/x-httpd-php .php AddType application/x-httpd-php-source .phps AddHandler php-script php AddHandler php-script phps There are two different levels in the above example. First, the plugin is "activated" in the LoadModule section. Then, the plugin is "mapped" to a content type in the AddHandler section. We believe that this is the way to go. Apache HTTPD is pervasive, and its model is well understood by many of the same folks who would want to use Nutch. Although we realize that this is a change from the way that Nutch currently works, and that people don't like change, we believe that this change is entirely needful and represents something that Nutch should adopt. /////////////////////////////////////////////////////////////////////// The above case you mention with respect to the application/pdf documents happens because in the parse-plugins.xml file there is a mapping of the parse-pdf plugin to the "application/pdf" mimeType, even though the parse-pdf plugin isn't activated by default via the plugin.includes property (note, this is the opposite case of the snippet that I pasted from the ImprovementProposal off the Wiki above). Therein lays the problem. My idea was that, similar to the above case in Apache HTTPD, if you map an "unactivated plugin" to a mimeType via parse-plugins.xml, then really, there is a configuration error there. I think that this is a LOG.severe() configuration error because you really need to "activate" a plugin, before you "map it" to a mime type. For example, why would you want to run a fetch if you have plugins mapped to mimeTypes via parse-plugins.xml that will never get called because they have never been activated? Before I run a fetch, I want to make sure of two important things: 1. I have enabled the entire set of appropriate parse plugins for the content that I want to fetch 2. I've mapped the enabled parsing plugins to the mimeTypes that they can deal with (in order of preference) If I ensure that I do both of these things, then we're fine in the above case you mention with the PDF files. Now, I know that this is a somewhat different process than what people are used to with Nutch. Totally understandable. But I think that the improvements that are reaped in the ParserFactory by doing it this way far outweigh the inconvenience of ensuring consistency between the plugin.includes property in nutch-default.xml and the parse-plugins.xml file. Of course, there is another issue. The current code committed in the trunk causes the fetcher to exit right out of the box for certain content types, because, as far as I can tell, the only enabled parse plugins out of the box are: parse-(text|html|js) I guess this is really a design issue in Nutch. Is there really any reason that the rest of the parsing plugins aren't enabled by default? I mean, I guess you guys want to go with the "smallest set" of parsing plugins that makes Nutch a functional search engine out of the box, no? If so, then I understand only having these parsing plugins enabled. But for instance, I would say that many of the other parsing plugins, being committed to the trunk and included in existing Nutch releases so far (e.g., parse-ext|mp3|mspowerpoint|msword|pdf|rss|rtf) are tested enough to be enabled by default, right? If the answer to that lies in a requirement similar to what I mentioned, i.e., you want to go with the "smallest set" of parse plugins out of the box, then two ways can deal with what's in trunk: 1. What you suggested, changing the LOG level to warning, instead of SEVERE, which alleviates the out-of-the-box functionality problem, but also opens up a problem where a user will wonder why the PDF content that he tried to fetch didn't get parsed even though it was mapped correctly in parse-plugins.xml (but not enabled via plugin.includes). Or 2. enabling the committed plugins by default in plugin.includes in conf/nutch-default.xml in the trunk, or at least by default enabling all the plugins which are currently listed in parse-plugins.xml in the trunk, which are: parse-text|msword|pdf|rss|msexcel|mspowerpoint|zip|js|rtf|html|ext Of course, it's up to you guys what you want to do, however, that's just my two cents. Take care, Chris > > > Enhance ParserFactory plugin selection policy > > --------------------------------------------- > > > > Key: NUTCH-88 > > URL: http://issues.apache.org/jira/browse/NUTCH-88 > > Project: Nutch > > Type: Improvement > > Components: indexer > > Versions: 0.7, 0.8-dev > > Reporter: Jerome Charron > > Assignee: Jerome Charron > > Fix For: 0.8-dev > > > > > The ParserFactory choose the Parser plugin to use based on the content- > types and path-suffix defined in the parsers plugin.xml file. > > The selection policy is as follow: > > Content type has priority: the first plugin found whose "contentType" > attribute matches the beginning of the content's type is used. > > If none match, then the first whose "pathSuffix" attribute matches the > end of the url's path is used. > > If neither of these match, then the first plugin whose "pathSuffix" is > the empty string is used. > > This policy has a lot of problems when no matching is found, because a > random parser is used (and there is a lot of chance this parser can't > handle the content). > > On the other hand, the content-type associated to a parser plugin is > specified in the plugin.xml of each plugin (this is the value used by the > ParserFactory), AND the code of each parser checks itself in its code if > the content-type is ok (it uses an hard-coded content-type value, and not > uses the value specified in the plugin.xml => possibility of missmatches > between content-type hard-coded and content-type delcared in plugin.xml). > > A complete list of problems and discussion aout this point is available > in: > > * http://www.mail-archive.com/nutch- > user%40lucene.apache.org/msg00744.html > > * http://www.mail-archive.com/nutch- > dev%40lucene.apache.org/msg00789.html > > -- > This message is automatically generated by JIRA. > - > If you think it was sent incorrectly contact one of the administrators: > http://issues.apache.org/jira/secure/Administrators.jspa > - > For more information on JIRA, see: > http://www.atlassian.com/software/jira
