[
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989
]
Ferdy edited comment on NUTCH-1097 at 9/2/11 2:04 PM:
------------------------------------------------------
After digging into it for a while, I believe the best solution for now is to
allow regexes in plugin.xml for the attribute contentType. This way multiple
mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of
the individual parser extensions. (Instead of plain using the wildcard
'asterisk')
Too keep backwards compatibility, I decided to escape 'plus' character in the
contentType attribute of extensions, because a lot of mimetypes contain this
character. This will not break existing functionality. So you can use any
regular expression supported by the standard Java Pattern except the 'plus'
character. The wildcard 'asterisk' is still usable, because this one is checked
first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is
not an valid regex.)
To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java
regexes with escaped 'plus' character.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so
it's consistent with the default provided parse-plugins.xml.
I'm not arguing these changes should be committed as is in the codebase, but I
do believe the current situation is not flexible enough. (Especially the fact
that many-to-one mappings of parse-plugins.xml cannot be supported by parser
plugin.xml files). So if you have any suggestions or corrections feel free to
reply.
(Sorry for the edits. The plus/asterisk characters are messing up my layout.)
was (Author: ferdy.g):
After digging into it for a while, I believe the best solution for now is
to allow regexes in plugin.xml for the attribute contentType. This way multiple
mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of
the individual parser extensions. (Instead of plain using the wildcard
'asterisk')
Too keep backwards compatibility, I decided to escape '+' in the contentType
attribute of extensions, because a lot of mimetypes contain this character.
This will not break existing functionality. So you can use any regular
expression supported by the standard Java Pattern except the '+' character. The
wildcard 'asterisk' is still usable, because this one is checked first in
ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an
valid regex.)
To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java
regexes with escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so
it's consistent with the default provided parse-plugins.xml.
I'm not arguing these changes should be committed as is in the codebase, but I
do believe the current situation is not flexible enough. (Especially the fact
that many-to-one mappings of parse-plugins.xml cannot be supported by parser
plugin.xml files). So if you have any suggestions or corrections feel free to
reply.
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow
> multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Ferdy
> Priority: Minor
> Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch,
> NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to
> accept application/xhtml+xml, however the plugin.xml of this plugin does not
> list this type. Either change the entry in parse-plugins.xml or change the
> parse-html plugin.xml. I suggest the latter. See patch.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira