[ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_63147 
]
     
Andrzej Bialecki  commented on NUTCH-34:
----------------------------------------

Stephan,

Regarding the urlfilter config file: well, my point was that it would be nice 
to have a single place to turn on/off various plugins. The alternative is to do 
it for each plugin separately... We could change the format though - instead of 
extensions we could perhaps use plugin IDs...? This file could also define the 
ordering (or priority) of plugins.

Regarding the plugin ordering: parser plugins are somewhat exceptional, because 
only one of them has to be invoked. Other plugins are used as a filtering chain 
- but even in those cases their order matters.

For the parsing plugins currently the algorithm works as follows (copied from 
ParserFactory): 

[Parser extensions should define the attributes "contentType" and/or 
"pathSuffix".  Content type has priority: the first plugin found whose 
"contentType" attribute matches the beginning of the content's type is used.  
If none match, then the first whose "pathSuffix" attribute matches the end of 
the url's path is used.  If neither of these match, then the first plugin whose 
"pathSuffix" is the empty string is used.]

This means that if there are more parsers for the same content type and path 
suffix, only the first on the list will always be used.

> Parsing different content formats
> ---------------------------------
>
>          Key: NUTCH-34
>          URL: http://issues.apache.org/jira/browse/NUTCH-34
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stephan Strittmatter
>     Priority: Trivial

>
> At the moment Nuch is set up to filter content by config the xml-config file.
> There it is also set global how many bytes are loaded.
> I think it yould be better to let the parser plugins "register" themselfe in 
> some registry where every plugin could tell the fetcher, that:
> 1. this document type is wanted (because the parser plugin is 
>    installed and activated)
> 2. how much of the content is required (some plugins need the whole 
>    content and some not)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to