Suggestion:
For consistency purpose, and easy of nutch management, why not filtering the
extensions based on the activated plugins?
By looking at the mime-types defined in the parse-plugins.xml file and the
activated plugins, we know which content-types will be parsed.
So, by getting the file extensions associated to each content-type, we can
build a list of file extensions to include (other ones will be excluded) in
the fetch process.
I'd asked a Nutch consultant this exact same question a few months ago.
It does seem odd that there's an implicit dependency between the file
suffixes found in regex-urlfilter.txt and the enabled plug-ins found
in nutch-default.xml and nutch-site.xml. What's the point of
downloading a 100MB .bz2 file if there's nobody available to handle
it?
It's also odd that there's a nutch-site.xml, but no equivalent for
regex-urlfilter.txt.
There are the cases of some suffixes (like .php) that can return any
kind of mime-type content, and other suffixes (like .xml) that can
mean any number of things. So I think you'd still want
regex-urlfilter.txt files (both a default and a site version) that
provide explicit additions/deletions to the list generated from the
installed and enabled parse-plugins.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers