[ https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989 ]
Ferdy edited comment on NUTCH-1097 at 9/2/11 2:04 PM: ------------------------------------------------------ After digging into it for a while, I believe the best solution for now is to allow regexes in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of the individual parser extensions. (Instead of plain using the wildcard 'asterisk') Too keep backwards compatibility, I decided to escape 'plus' character in the contentType attribute of extensions, because a lot of mimetypes contain this character. This will not break existing functionality. So you can use any regular expression supported by the standard Java Pattern except the 'plus' character. The wildcard 'asterisk' is still usable, because this one is checked first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an valid regex.) To summarize the latest patch (v3) contains 2 changes: - ParserFactory matches contentType attribute of extensions using standard Java regexes with escaped 'plus' character. - parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent with the default provided parse-plugins.xml. I'm not arguing these changes should be committed as is in the codebase, but I do believe the current situation is not flexible enough. (Especially the fact that many-to-one mappings of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions or corrections feel free to reply. (Sorry for the edits. The plus/asterisk characters are messing up my layout.) was (Author: ferdy.g): After digging into it for a while, I believe the best solution for now is to allow regexes in plugin.xml for the attribute contentType. This way multiple mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of the individual parser extensions. (Instead of plain using the wildcard 'asterisk') Too keep backwards compatibility, I decided to escape '+' in the contentType attribute of extensions, because a lot of mimetypes contain this character. This will not break existing functionality. So you can use any regular expression supported by the standard Java Pattern except the '+' character. The wildcard 'asterisk' is still usable, because this one is checked first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an valid regex.) To summarize the latest patch (v3) contains 2 changes: - ParserFactory matches contentType attribute of extensions using standard Java regexes with escaped '+' characters. - parse-html's plugin.xml has contentType text/html|application/xhtml+xml so it's consistent with the default provided parse-plugins.xml. I'm not arguing these changes should be committed as is in the codebase, but I do believe the current situation is not flexible enough. (Especially the fact that many-to-one mappings of parse-plugins.xml cannot be supported by parser plugin.xml files). So if you have any suggestions or corrections feel free to reply. > application/xhtml+xml should be enabled in plugin.xml of parse-html; allow > multiple mimetypes for plugin.xml > ------------------------------------------------------------------------------------------------------------ > > Key: NUTCH-1097 > URL: https://issues.apache.org/jira/browse/NUTCH-1097 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.3 > Reporter: Ferdy > Priority: Minor > Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, > NUTCH-1097-v3.patch > > > The configuration in parse-plugins.xml expects the parse-html plugin to > accept application/xhtml+xml, however the plugin.xml of this plugin does not > list this type. Either change the entry in parse-plugins.xml or change the > parse-html plugin.xml. I suggest the latter. See patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira