[Nutch-dev] Re: Urlfilter Patch

Ken Krugler Thu, 01 Dec 2005 14:09:45 -0800

Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fetch process.


I'd asked a Nutch consultant this exact same question a few months ago.

It does seem odd that there's an implicit dependency between the filesuffixes found in regex-urlfilter.txt and the enabled plug-ins foundin nutch-default.xml and nutch-site.xml. What's the point ofdownloading a 100MB .bz2 file if there's nobody available to handleit?

It's also odd that there's a nutch-site.xml, but no equivalent forregex-urlfilter.txt.

There are the cases of some suffixes (like .php) that can return anykind of mime-type content, and other suffixes (like .xml) that canmean any number of things. So I think you'd still wantregex-urlfilter.txt files (both a default and a site version) thatprovide explicit additions/deletions to the list generated from theinstalled and enabled parse-plugins.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Urlfilter Patch

Reply via email to