[jira] Commented: (NUTCH-34) Parsing different content formats

Andrzej Bialecki (JIRA) Mon, 18 Apr 2005 09:11:51 -0700

     [ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_63049 
]
     
Andrzej Bialecki  commented on NUTCH-34:
----------------------------------------


Stephan & Jerome,

Let me explain why I think a boolean is useful (though strictly speaking not 
required, as you noticed). When the Fetcher gets a content header (be it HTTP 
or other protocol header), it learns about the content type (if present) and 
the content size. Based on the content type it can select a parse plugin. Now, 
if the content size exceeds the maximum size set in the plugin, the Fetcher 
currently has only one choice - to fetch up to maximum size of bytes, pass this 
partial content to the plugin and pray that it works. However, if we introduce 
a boolean property with the meaning "plugin can handle partial content", then 
the Fetcher can make an informed decision whether to fetch the partial content 
at all. As a result, we can gain significant bandwidth/disk space/CPU savings. 
Also, this type of information is very easy to provide... Setting the maximum 
size to "0" has different semantics - it simply means that Fetcher should fetch 
all content, no matter its size.

Regarding the plugin registry: IMHO it needs a configuration file anyway. There 
needs to be a mechanism in place to preserve ordering and priority of active 
plugins (more sophisticated than the current nearly random way), especially if 
more than one plugin handles the same mime type. I agree that it's convenient 
if each plugin "registers itself" for handling given mime types, but I would 
add to that "with certain priority if more than one plugin exists for a given 
type". Again, IMHO, it is convenient also to have a single place to quickly 
turn on/off various plugins - it could be a config file, or it could be an API 
(perhaps both?).

> Parsing different content formats
> ---------------------------------
>
>          Key: NUTCH-34
>          URL: http://issues.apache.org/jira/browse/NUTCH-34
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stephan Strittmatter
>     Priority: Trivial

>
> At the moment Nuch is set up to filter content by config the xml-config file.
> There it is also set global how many bytes are loaded.
> I think it yould be better to let the parser plugins "register" themselfe in 
> some registry where every plugin could tell the fetcher, that:
> 1. this document type is wanted (because the parser plugin is 
>    installed and activated)
> 2. how much of the content is required (some plugins need the whole 
>    content and some not)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-34) Parsing different content formats

Reply via email to