my (possibly naive) approch for crawling is what was suggested on the nutch
version .7 tutorial:
1 admin db -create
2 give my list of seed urls, inject db
while(not enough web pages crawled)
{
generate segments
fetch segments/*
updatedb db segments/*
}
is there a better way to do this? automating the process?
On Nov 6, 2007 10:06 AM, Josh Attenberg <[EMAIL PROTECTED]> wrote:
> I am conducting web research, and think that nutch will be a useful tool
> to aid my quest for information. I am interested in performing a large crawl
> (100 million pages+), analyzing the contents of these pages, including
> building a link graph. I have figured out how to get a large list of pages
> with fetch, then boot strap the to crawl list and re-crawl as per
> http://lucene.apache.org/nutch/tutorial.html.
> If this isnt the best way to perform a large crawl, please provide
> suggestions. I dont know if Nutch has any tools for building a web graph,
> but i would have no trouble building it on my own, if i knew how to access
> the pages contents. Unfortunately I have no idea how to do this. once pages
> are fetched, how does one view the HTML data?
>