Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by SebastienLeCallonnec:
http://wiki.apache.org/nutch/ParserFactoryImprovementProposal

The comment on the change is:
Corrected some typos

------------------------------------------------------------------------------
   *No plugin defined in parse-plugins for a specified content-type, but many 
activated plugins that can deal with this content-type.
   *Many plugins defined in the parse-plugins for a specified content-type, but 
with the same priority
  
- This is unfortunately is something that as developers we cannot elegantly 
prevent in this case – erroneous input by the user. We propose a simple way 
to handle this is: if the user specifies multiple parse-plugins with the same 
priority, then LOG.severe(), and exit. This isn’t anything outside of what 
other systems do with bogus user input. For instance, in Apache HTTPD, if a 
user specifies that .cgi files should be handled by a text-handler, ''and'' by 
a perl-handler, Apache HTTPD will come back, and log an error message, and 
exit, which we believe is the correct thing to do in that case. The 
parse-plugins.xml file will need to be examined by the users of the Nutch 
system, and they will need to ensure that they don’t’ set the priorities 
for 2 different parse plugins to be the same for a particular mimeType. We 
propose to note this in a comment in the parse-plugins.xml file, and then also 
note it as a major change in the Nutch installation process.
+ This unfortunately is something that as developers we cannot elegantly 
prevent in this case – erroneous input by the user. We propose a simple way 
to handle this is: if the user specifies multiple parse-plugins with the same 
priority, then LOG.severe(), and exit. This isn’t anything outside of what 
other systems do with bogus user input. For instance, in Apache HTTPD, if a 
user specifies that .cgi files should be handled by a text-handler, ''and'' by 
a perl-handler, Apache HTTPD will come back, and log an error message, and 
exit, which we believe is the correct thing to do in that case. The 
parse-plugins.xml file will need to be examined by the users of the Nutch 
system, and they will need to ensure that they don’t’ set the priorities 
for 2 different parse plugins to be the same for a particular mimeType. We 
propose to note this in a comment in the parse-plugins.xml file, and then also 
note it as a major change in the Nutch installation process.
  
  === Path Suffix Attribute in plugin.xml files and erroneous mime types 
returned by web servers for files ===
- Another one of the main impacts of having a file like parse-plugins.xml is 
that no longer should the pathSuffix="" be part of the plugin.xml descriptor. 
We propose to move that out of plugin.xml and into the mime-types.xml file. 
Additionally, we can also "kill two birds with one stone" here and handle an 
oft-occuring problem users are experiencing with Nutch in terms of errorneous 
mime types returned by web servers for particular files. Specifically we 
propose to add an MimeType Alias mapper to the mime-types.xml file that will 
allow us to map the standard IANA mime types to other web server returned mime 
types that are non-standard. These two proposed changes to mime-types.xml would 
look like the following:
+ Another one of the main impacts of having a file like parse-plugins.xml is 
that no longer should the pathSuffix="" be part of the plugin.xml descriptor. 
We propose to move that out of plugin.xml and into the mime-types.xml file. 
Additionally, we can also "kill two birds with one stone" here and handle an 
oft-occuring problem users are experiencing with Nutch in terms of erroneous 
mime types returned by web servers for particular files. Specifically we 
propose to add an MimeType Alias mapper to the mime-types.xml file that will 
allow us to map the standard IANA mime types to other web server returned mime 
types that are non-standard. These two proposed changes to mime-types.xml would 
look like the following:
  
  {{{
  
@@ -115, +115 @@

  
  ''Incompatibilities''
  
- By moving the pathSuffix out of the plugin.xml file, and into the 
mime-types.xml file, this would create an updated version of the plugin.xml 
descriptor schema for each plugin, along with an updated mime-types.xml 
descriptor schema. Additionally, storing the mime type aliases in the 
mime-types.xml file will also require an addition to the mime-types.xml schema. 
To lessen the effect on previous and near-term releases of Nutch the pathSuffix 
attribute could be left as an option in the plugin.xml schema, but marked as 
“deprecated” to let people know that this functionality isn’t part of the 
parse plugin identification process anymore, but it is left in the schema so as 
not to create incompatibilities with the plugin.xml files that people have 
already wrote. However, ultimately in future releases of Nutch, we propose that 
the pathSuffix attribute should be removed from the plugin.xml schema.
+ By moving the pathSuffix out of the plugin.xml file, and into the 
mime-types.xml file, this would create an updated version of the plugin.xml 
descriptor schema for each plugin, along with an updated mime-types.xml 
descriptor schema. Additionally, storing the mime type aliases in the 
mime-types.xml file will also require an addition to the mime-types.xml schema. 
To lessen the effect on previous and near-term releases of Nutch the pathSuffix 
attribute could be left as an option in the plugin.xml schema, but marked as 
“deprecated” to let people know that this functionality isn’t part of the 
parse plugin identification process anymore, but it is left in the schema so as 
not to create incompatibilities with the plugin.xml files that people have 
already written. However, ultimately in future releases of Nutch, we propose 
that the pathSuffix attribute should be removed from the plugin.xml schema.
  
  The proposed capability addition will simply control the order in which 
parsing plugins get called during fetching activities. It won’t directly 
impact the segments stored, or the webapp. It will only affect the fetcher 
component, and the mime types component.
  

Reply via email to