[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989
 ] 

Ferdy edited comment on NUTCH-1097 at 9/2/11 2:04 PM:
------------------------------------------------------

After digging into it for a while, I believe the best solution for now is to 
allow regexes in plugin.xml for the attribute contentType. This way multiple 
mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of 
the individual parser extensions. (Instead of plain using the wildcard 
'asterisk')

Too keep backwards compatibility, I decided to escape 'plus' character in the 
contentType attribute of extensions, because a lot of mimetypes contain this 
character. This will not break existing functionality. So you can use any 
regular expression supported by the standard Java Pattern except the 'plus' 
character. The wildcard 'asterisk' is still usable, because this one is checked 
first in ParserFactory. (Otherwise an exception occurs because 'asterisk' is 
not an valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java 
regexes with escaped 'plus' character.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so 
it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I 
do believe the current situation is not flexible enough. (Especially the fact 
that many-to-one mappings of parse-plugins.xml cannot be supported by parser 
plugin.xml files). So if you have any suggestions or corrections feel free to 
reply.

(Sorry for the edits. The plus/asterisk characters are messing up my layout.)

      was (Author: ferdy.g):
    After digging into it for a while, I believe the best solution for now is 
to allow regexes in plugin.xml for the attribute contentType. This way multiple 
mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of 
the individual parser extensions. (Instead of plain using the wildcard 
'asterisk')

Too keep backwards compatibility, I decided to escape '+' in the contentType 
attribute of extensions, because a lot of mimetypes contain this character. 
This will not break existing functionality. So you can use any regular 
expression supported by the standard Java Pattern except the '+' character. The 
wildcard 'asterisk' is still usable, because this one is checked first in 
ParserFactory. (Otherwise an exception occurs because 'asterisk' is not an 
valid regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java 
regexes with escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so 
it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I 
do believe the current situation is not flexible enough. (Especially the fact 
that many-to-one mappings of parse-plugins.xml cannot be supported by parser 
plugin.xml files). So if you have any suggestions or corrections feel free to 
reply.
  
> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1097
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1097
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Ferdy
>            Priority: Minor
>         Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
> NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to