Re: Handling certain URLs in Nutch possibly with appropriate normalization?

Vijay Krishnan Thu, 15 May 2008 16:32:37 -0700

During the Crawl-Index pipeline, I see the relevant URL getting
printed out by the Fetcher but not by the indexer. The most plausible
explanation is that nutch is unable to crawl such URLs and therefore
subsequent operations cannot be performed on these pages. It does not
print out very suggestive error messages. So by "nutch failing" I mean
inability to crawl such URLs, leading to subsequent operations not
being possible.



Vijay

On Thu, May 15, 2008 at 9:17 AM,  <[EMAIL PROTECTED]> wrote:
> Vijay,
>
>  When you say things "fail", what exactly happens?
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
>  ----- Original Message ----
>  > From: Vijay Krishnan <[EMAIL PROTECTED]>
>  > To: nutch-user@lucene.apache.org
>  > Sent: Wednesday, May 14, 2008 7:57:01 PM
>  > Subject: Handling certain URLs in Nutch possibly with appropriate 
> normalization?
>  >
>  > Hi all,
>  >
>  >      I find that typing URLs in certain ways often gets nutch to bomb
>  > even though it works fine in the browser and even when I try to open a
>  > HTTPURLConnection to those URLs using Java. For example:
>  >
>  > 1. The url
>  > 
> http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates/
>  > works fine when I try to index it using nutch but writing it as:
>  > 
> http://www.techcrunch.com/2008/05/14/nys-amazon-tax-takes-first-casualty-overstock-affiliates
>  > (without the slash in the end) causes it to fail.
>  >
>  > 2. The url http://www.go2linux.org/fedora-centos-root-password-recovery
>  > gets crawled and indexed properly whereas the url:
>  > http://www.go2linux.org/fedora-centos-root-password-recovery/ fails.
>  >
>  >     As I mentioned, all of these work fine when I try to open an
>  > HTTPURLConnection to them from java. Is there a simple patch I can use
>  > for cases like this?
>  >
>  >      In addition, it appears that nutch does some simple URL
>  > normalization like adding a slash to the end of a domain name. Is it
>  > easy to call the URLNormalizer of Nutch independently of the crawling
>  > and indexing process? A pointer to the  class/method will be very
>  > useful.
>  >
>  >
>  > Thanks,
>  > Vijay
>  > http:/www.cs.stanford.edu/~vijayk
>
>



-- 
Vijay Krishnan
Founder, Infoaxe Inc.
http://www.cs.stanford.edu/~vijayk
http://www.infoaxe.com/hiring.html

Re: Handling certain URLs in Nutch possibly with appropriate normalization?

Reply via email to