Re: use crawl command to fetch arbitrary pages?

Hilkiah Lavinier Wed, 23 Apr 2008 06:36:28 -0700

Ywang,

Not sure what you mean by arbitrary,maybe u need to be a bit more specific here.


However if you are trying to do a webcrawl, here's some advice :
- consider using urlfilter-suffix instead of one of the regex filters.  
- also, set db.ignore.external.links to false which allows nutch to fetch pages 
outside of initial injected list (i.e. domains)
- lastly since u must have a starting point, create an inject list which would 
allow you to fetch the pages desired

There is a crawl script availabe on the nutch wiki which you can use instead of 
./bin/nutch crawl.

Regards,
 
Hilkiah G. Lavinier MEng (Hons), ACGI 
6 Winston Lane, 
Goodwill, 
Roseau, Dominica

Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487


Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21

----- Original Message ----
From: ywang <[EMAIL PROTECTED]>
To: "[email protected]" <[email protected]>
Sent: Saturday, April 19, 2008 10:32:17 AM
Subject: use crawl command to fetch arbitrary pages?


Dear all,
How can I use crawl command to fetch arbitrary pages, without being restricted 
in a domain which defined in crawl-urlfilter.txt?

I try to delete or logout that domain property, but the shell will give me a 
error like "No urls to fetch - check your seed list and URL filters.

Oh, in addition, crawl command works well with setting the domain property. 

Cheers

Yong
2008-04-19 



ywang 






      
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Re: use crawl command to fetch arbitrary pages?

Reply via email to