I was trying to fetch one specific url with ? symbol and nutch was refusing to
fetch it. But if I fetch domain itself, nutch fetched links with ? symbol also.
Now, I noticed that nutch did not fetch all files on this given domain. But if
I direct nutch to an unfetched? file's? url it fetches it.? I used this command
"bin/nutch crawl urls -dir crawl -depth 6". If I specify -topN 50 nutch does
not fetch my files at all.
So, my question is, how to make nutch to fetch all files under a given domain?
Thanks.
A.
-----Original Message-----
From: [email protected]
To: [email protected]
Sent: Mon, 2 Mar 2009 3:36 pm
Subject: Re: urls with ? and & symbols
Hello,
I have one specific domain. I tested further and it looks like nutch? fetches
this domain's other links but the ones with ?. Also nutch fetches other domains
with ? symbol.
How to know if robots.txt on this domain blocks this specific links to be
fetched?
Thanks.
A.
-----Original Message-----
From: Bartosz Gadzimski <[email protected]>
To: [email protected]
Sent: Sun, 1 Mar 2009 11:13 am
Subject: Re: urls with ? and & symbols
[email protected] pisze:?
> Hello,?
>?
> I use nutch-0.9 and try to index urls with ? and & symbols. I have commented
this line? -[...@=] in conf/crawl-urlfilter.txt, conf/automaton-urlfilter and
conf/regex-urlfilter.txt files.?
> However nutch still ignores these urls.?
>?
> Does anyone know how this can be fixed??
>?
> Thanks in advance.?
> A.?
>?
>?
>
>?
>?
>?
>?
>
Hi,?
?
If you commented out those line it should be fine. That part is correct
so problem is somewhere else.?
?
I must give us more information like:?
- does your nutch crawles and index "normal" URL's (without ? and &)?
- are you crawling domains that are NOT blocked in crawl-urlfilter?
- is robots.txt on this domain doesn't block your url's?
- are your talking about one specific domain or many different??
?
Thanks,?
Bartosz?