Hi Hilkiah G. Lavinier,
Ur second point "also, set db.ignore.external.links to false which allows
nutch to fetch pages outside of initial injected list (i.e. domains)“ exactly
answers my question, which was to use ./bin/nutch crawl command to fetch pages
which are not only restricted in the initial domains. In other words, the
spider can go outside from the initial domains setting in urls/xx.txt.
Thanks, and have a good day
Yong
2008-04-24
ywang
发件人: Hilkiah Lavinier
发送时间: 2008-04-23 21:40:05
收件人: [email protected]
抄送:
主题: Re: use crawl command to fetch arbitrary pages?
Ywang,
Not sure what you mean by arbitrary,maybe u need to be a bit more specific here.
However if you are trying to do a webcrawl, here's some advice :
- consider using urlfilter-suffix instead of one of the regex filters.
- also, set db.ignore.external.links to false which allows nutch to fetch pages
outside of initial injected list (i.e. domains)
- lastly since u must have a starting point, create an inject list which would
allow you to fetch the pages desired
There is a crawl script availabe on the nutch wiki which you can use instead of
./bin/nutch crawl.
Regards,
Hilkiah G. Lavinier MEng (Hons), ACGI
6 Winston Lane,
Goodwill,
Roseau, Dominica
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201 / AOL hilkiah21
----- Original Message ----
From: ywang <[EMAIL PROTECTED]>
To: "[email protected]" <[email protected]>
Sent: Saturday, April 19, 2008 10:32:17 AM
Subject: use crawl command to fetch arbitrary pages?
Dear all,
How can I use crawl command to fetch arbitrary pages, without being restricted
in a domain which defined in crawl-urlfilter.txt?
I try to delete or logout that domain property, but the shell will give me a
error like "No urls to fetch - check your seed list and URL filters.
Oh, in addition, crawl command works well with setting the domain property.
Cheers
Yong
2008-04-19
ywang
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now.
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
Powered by UESTC SMG
SPAM, virus-free and secure email
https://smg.uestc.edu.cn