Here are few questions I had about crawl-urlfilter.txt.
- Does Nutch obey crawl-urlfilter.txt properly? By default, it is set to not download css, but when I do the crawl, I do see parse.ParseUtil exceptions in my Hadoop.log (org.apache.nutch.parse.ParseException: parser not found for contentType=text/css) Doesn't this mean that Nutch has actually downloaded a css file and is trying to parse it? - Can I put a positive filter in crawl-urlfilter.txt? Like +\.(html, htm) Instead of current one which starts with "-"? Will it make Nutch only download files with extension htm and html? - Are the extensions in crawl-urlfilter.txt case sensitive or not? i.e. do I have to add mp3, MP3, Mp3 to tell Nutch to not to download mp3 files? - How does Nutch handle URLs which are GET but does not end with extension? i.e. if there is a URL like http://www.mysite.com/images/1 which returns an image, will Nutch be able to identify it and avoid it's download? TIA, --Hrishi DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.