Hi Anita.
I tried crawling autos.aols.com, and I could find pages similar to
what our looking at in 3 crawls. (I injected http://autos.aol.com/
and added autos.aol.com to my regex filter to allow it)
eg.
fetching http://autos.aol.com/bmw-650-2007:8774-photos
fetching http://autos.aol.com/article/general/v2/_a/auto-
financing-101/20060818153509990001
fetching http://autos.aol.com/options_trimless?v=8544
fetching http://autos.aol.com/toyota-camry-hybrid-2007:8322-overviewl
fetching http://autos.aol.com/bmw-m-2007:8905-overview
fetching http://autos.aol.com/getaquote?myid=8623
fetching http://autos.aol.com/options_trimless?v=8226
fetching http://autos.aol.com/options_trimless?v=7803
fetching http://autos.aol.com/article/power/v2/_a/2006-dodge-charger-
srt8/20061030193309990001
fetching http://autos.aol.com/bmw-x3-2007:8770-specs
fetching http://autos.aol.com/saturn-vue-2007:8371-overview
fetching http://autos.aol.com/aston-martin-vanquish-2006:8115-overview
fetching http://autos.aol.com/options_trimless?v=8394
fetching http://autos.aol.com/jaguar-listings:JA---
fetching http://autos.aol.com/volkswagen-rabbit-2007:8554-overview
fetching http://autos.aol.com/bmw-x5-2007:8817-overview
fetching http://autos.aol.com/audi-a4-2007:8622-specs
fetching http://autos.aol.com/options_trimless?v=8416
fetching http://autos.aol.com/getaquote?myid=8774
the differences is that I am using the latest nutch (SVN head), and
am just using a local store, not hadoop.
what I would do next if I were you is to check your regex filters to
make sure you are not blocking things with a colon ':' in them for
some strange reason,
and possibly upgrade to the latest and greatest version of nutch.
(0.9.1)
regards
Ian.
On 18/04/2007, at 5:56 AM, [EMAIL PROTECTED] wrote:
Hi
I am a new Nutch user, and am using Nutch 8.1 with Hadoop. The
domain I am
trying to crawl _http://autos.aol.com_ (http://autos.aol.com) . I
am crawling
to the depth of 10.
There are certain pages that Nutch could not fetch. An example
would be
_http://autos.aol.com/acura-rl-2006:8060-review_
(http://autos.aol.com/acura-rl-2006:8060-review) .
The referring url to this page is
_http://autos.aol.com/acura-rl-2007:8060-review_ (http://
autos.aol.com/acura-rl-2007:8060-review) . This url was there
in the fetch list.
I did a mini crawl pointing directly to
_http://autos.aol.com/acura-rl-2007:8060-review_ (http://
autos.aol.com/acura-rl-2007:8060-review) , then the page
_http://autos.aol.com/acura-rl-2006:8060-review_
(http://autos.aol.com/acura-rl-2006:8060-review) gets fetched.
Does anyone have any ideas on why I am seeing this behavior.
Thanks
Anita Bidari (X55746)
************************************** See what's free at http://
www.aol.com.
Ian Holsman
[EMAIL PROTECTED]
http://parent-chatter.com -- what do parents know?