A few questions about crawl-urlfilter.txt

Hrishikesh Agashe Tue, 14 Jul 2009 05:13:33 -0700

Here are few questions I had about crawl-urlfilter.txt.


-          Does Nutch obey crawl-urlfilter.txt properly? By default, it is set 
to not download css, but when I do the crawl, I do see parse.ParseUtil 
exceptions in my Hadoop.log (org.apache.nutch.parse.ParseException: parser not 
found for contentType=text/css)
Doesn't this mean that Nutch has actually downloaded a css file and is trying 
to parse it?


-          Can I put a positive filter in crawl-urlfilter.txt? Like

+\.(html, htm)

Instead of current one which starts with "-"? Will it make Nutch only download 
files with extension htm and html?



-          Are the extensions in crawl-urlfilter.txt case sensitive or not?  
i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to download mp3 files?



-          How does Nutch handle URLs which are GET but does not end with 
extension? i.e. if there is a URL like http://www.mysite.com/images/1 which 
returns an image, will Nutch be able to identify it and avoid it's download?

TIA,
--Hrishi


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

A few questions about crawl-urlfilter.txt

Reply via email to