Hi Anita.

I tried crawling autos.aols.com, and I could find pages similar to what our looking at in 3 crawls. (I injected http://autos.aol.com/ and added autos.aol.com to my regex filter to allow it)


eg.
fetching http://autos.aol.com/bmw-650-2007:8774-photos
fetching http://autos.aol.com/article/general/v2/_a/auto- financing-101/20060818153509990001
fetching http://autos.aol.com/options_trimless?v=8544
fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl
fetching http://autos.aol.com/bmw-m-2007:8905-overview
fetching http://autos.aol.com/getaquote?myid=8623
fetching http://autos.aol.com/options_trimless?v=8226
fetching http://autos.aol.com/options_trimless?v=7803
fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger- srt8/20061030193309990001
fetching http://autos.aol.com/bmw-x3-2007:8770-specs
fetching http://autos.aol.com/saturn-vue-2007:8371-overview
fetching http://autos.aol.com/aston-martin-vanquish-2006:8115-overview
fetching http://autos.aol.com/options_trimless?v=8394
fetching http://autos.aol.com/jaguar-listings:JA---
fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview
fetching http://autos.aol.com/bmw-x5-2007:8817-overview
fetching http://autos.aol.com/audi-a4-2007:8622-specs
fetching http://autos.aol.com/options_trimless?v=8416
fetching http://autos.aol.com/getaquote?myid=8774

the differences is that I am using the latest nutch (SVN head), and am just using a local store, not hadoop.

what I would do next if I were you is to check your regex filters to make sure you are not blocking things with a colon ':' in them for some strange reason, and possibly upgrade to the latest and greatest version of nutch. (0.9.1)

regards
Ian.



On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote:

Hi

I am a new Nutch user, and am using Nutch 8.1 with Hadoop. The domain I am trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I am crawling
to the depth  of 10.
There are certain pages that Nutch could not fetch. An example would be
_http://autos.aol.com/acura-rl-2006:8060-review_
(http://autos.aol.com/acura-rl-2006:8060-review) .

The referring url to this page is
_http://autos.aol.com/acura-rl-2007:8060-review_ (http:// autos.aol.com/acura-rl-2007:8060-review) . This url was there
in the fetch list.

I did a mini crawl pointing directly to
_http://autos.aol.com/acura-rl-2007:8060-review_ (http:// autos.aol.com/acura-rl-2007:8060-review) , then the page
_http://autos.aol.com/acura-rl-2006:8060-review_
(http://autos.aol.com/acura-rl-2006:8060-review)  gets  fetched.

Does anyone have any ideas on why I am seeing this  behavior.


Thanks
Anita Bidari (X55746)




************************************** See what's free at http:// www.aol.com.

Ian Holsman
[EMAIL PROTECTED]
http://parent-chatter.com -- what do parents know?


Reply via email to