Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by ChrisMattmann: http://wiki.apache.org/nutch/ParserFactoryImprovementProposal The comment on the change is: added expected behavior when user enables plugins, but doesn't map them to ctype ------------------------------------------------------------------------------ === Maintaining consistency between parse-plugins.xml and nutch-default.xml activated plugins === An interesting question arises in the following two examples: - *No plugin defined in parse-plugins for a specified content-type, but many activated plugins that can deal with this content-type. + 1. No plugin defined in parse-plugins for a specified content-type, but many activated plugins that can deal with this content-type. - *Many plugins defined in the parse-plugins for a specified content-type, but with the same priority + 2. Many plugins defined in the parse-plugins for a specified content-type, but with the same priority + This unfortunately is something that as developers we cannot elegantly prevent in this case â erroneous input by the user. We propose a simple way to handle this is: + + For example 1 above â if the user activates many parser plugins via the plugin.includes nutch conf property, but then fails to map those plugins to a particular content type in the parse-plugins.xml, then the plugins simply won't get called for parsing. They will be enabled, but will not be returned as viable parsers in the ordered list of parsing plugins for a content type. + - This unfortunately is something that as developers we cannot elegantly prevent in this case â erroneous input by the user. We propose a simple way to handle this is: if the user specifies multiple parse-plugins with the same priority, then LOG.severe(), and exit. This isnât anything outside of what other systems do with bogus user input. For instance, in Apache HTTPD, if a user specifies that .cgi files should be handled by a text-handler, ''and'' by a perl-handler, Apache HTTPD will come back, and log an error message, and exit, which we believe is the correct thing to do in that case. The parse-plugins.xml file will need to be examined by the users of the Nutch system, and they will need to ensure that they donâtâ set the priorities for 2 different parse plugins to be the same for a particular mimeType. We propose to note this in a comment in the parse-plugins.xml file, and then also note it as a major change in the Nutch installation process. + For example 2 above - if the user specifies multiple parser plugin ids for a content type in parse-plugins.xml with the same priority, then LOG.severe(), and exit. This isnât anything outside of what other systems do with bogus user input. For instance, in Apache HTTPD, if a user specifies that .cgi files should be handled by a text-handler, ''and'' by a perl-handler, Apache HTTPD will come back, and log an error message, and exit, which we believe is the correct thing to do in that case. The parse-plugins.xml file will need to be examined by the users of the Nutch system, and they will need to ensure that they donâtâ set the priorities for 2 different parse plugins to be the same for a particular mimeType. We propose to note this in a comment in the parse-plugins.xml file, and then also note it as a major change in the Nutch installation process. === Path Suffix Attribute in plugin.xml files and erroneous mime types returned by web servers for files === Another one of the main impacts of having a file like parse-plugins.xml is that no longer should the pathSuffix="" be part of the plugin.xml descriptor. We propose to move that out of plugin.xml and into the mime-types.xml file. Additionally, we can also "kill two birds with one stone" here and handle an oft-occuring problem users are experiencing with Nutch in terms of erroneous mime types returned by web servers for particular files. Specifically we propose to add an MimeType Alias mapper to the mime-types.xml file that will allow us to map the standard IANA mime types to other web server returned mime types that are non-standard. These two proposed changes to mime-types.xml would look like the following:
