Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fetch process.

I'd asked a Nutch consultant this exact same question a few months ago.

It does seem odd that there's an implicit dependency between the file suffixes found in regex-urlfilter.txt and the enabled plug-ins found in nutch-default.xml and nutch-site.xml. What's the point of downloading a 100MB .bz2 file if there's nobody available to handle it?

It's also odd that there's a nutch-site.xml, but no equivalent for regex-urlfilter.txt.

There are the cases of some suffixes (like .php) that can return any kind of mime-type content, and other suffixes (like .xml) that can mean any number of things. So I think you'd still want regex-urlfilter.txt files (both a default and a site version) that provide explicit additions/deletions to the list generated from the installed and enabled parse-plugins.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to