[jira] Kommentiert: (NUTCH-34) Parsing different content formats

Stephan Strittmatter (JIRA) Mon, 18 Apr 2005 03:25:58 -0700

     [ http://issues.apache.org/jira/browse/NUTCH-34?page=comments#action_63038 
]
     
Stephan Strittmatter commented on NUTCH-34:
-------------------------------------------


Yes I agree, the supprted mime type(s) and file extension(s) should also be 
"registered" by the parser plugins in order to inform the crowler wether to 
transfere content or ignore it.

Having this, we could give up this many mails why a specific format is not 
crawled because it is in the ignore-list of the crawlers config. This config 
would be obsolete then because the plugins know themselfe what has to be 
crawled or not.

> Parsing different content formats
> ---------------------------------
>
>          Key: NUTCH-34
>          URL: http://issues.apache.org/jira/browse/NUTCH-34
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stephan Strittmatter
>     Priority: Trivial

>
> At the moment Nuch is set up to filter content by config the xml-config file.
> There it is also set global how many bytes are loaded.
> I think it yould be better to let the parser plugins "register" themselfe in 
> some registry where every plugin could tell the fetcher, that:
> 1. this document type is wanted (because the parser plugin is 
>    installed and activated)
> 2. how much of the content is required (some plugins need the whole 
>    content and some not)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Kommentiert: (NUTCH-34) Parsing different content formats

Reply via email to