During the Crawl-Index pipeline, I see the relevant URL getting printed out by the Fetcher but not by the indexer. The most plausible explanation is that nutch is unable to crawl such URLs and therefore subsequent operations cannot be performed on these pages. It does not print out very suggestive error messages. So by "nutch failing" I mean inability to crawl such URLs, leading to subsequent operations not being possible.
Vijay On Thu, May 15, 2008 at 9:17 AM, <[EMAIL PROTECTED]> wrote: > Vijay, > > When you say things "fail", what exactly happens? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > ----- Original Message ---- > > From: Vijay Krishnan <[EMAIL PROTECTED]> > > To: nutch-user@lucene.apache.org > > Sent: Wednesday, May 14, 2008 7:57:01 PM > > Subject: Handling certain URLs in Nutch possibly with appropriate > normalization? > > > > Hi all, > > > > I find that typing URLs in certain ways often gets nutch to bomb > > even though it works fine in the browser and even when I try to open a > > HTTPURLConnection to those URLs using Java. For example: > > > > 1. The url > > > http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates/ > > works fine when I try to index it using nutch but writing it as: > > > http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates > > (without the slash in the end) causes it to fail. > > > > 2. The url http://www.go2linux.org/fedora-centos-root-password-recovery > > gets crawled and indexed properly whereas the url: > > http://www.go2linux.org/fedora-centos-root-password-recovery/ fails. > > > > As I mentioned, all of these work fine when I try to open an > > HTTPURLConnection to them from java. Is there a simple patch I can use > > for cases like this? > > > > In addition, it appears that nutch does some simple URL > > normalization like adding a slash to the end of a domain name. Is it > > easy to call the URLNormalizer of Nutch independently of the crawling > > and indexing process? A pointer to the class/method will be very > > useful. > > > > > > Thanks, > > Vijay > > http:/www.cs.stanford.edu/~vijayk > > -- Vijay Krishnan Founder, Infoaxe Inc. http://www.cs.stanford.edu/~vijayk http://www.infoaxe.com/hiring.html