anyone?

On Nov 8, 2007 9:04 AM, Josh Attenberg <[EMAIL PROTECTED]> wrote:

> my (possibly naive) approch for crawling is what was suggested on the
> nutch version .7 tutorial:
> 1 admin db -create
> 2 give my list of seed urls, inject db
>
> while(not enough web pages crawled)
> {
> generate segments
> fetch segments/*
> updatedb db segments/*
> }
>
> is there a better way to do this? automating the process?
>
>
> On Nov 6, 2007 10:06 AM, Josh Attenberg < [EMAIL PROTECTED]> wrote:
>
> > I am conducting web research, and think that nutch will be a useful tool
> > to aid my quest for information. I am interested in performing a large crawl
> > (100 million pages+), analyzing the contents of these pages, including
> > building a link graph. I have figured out how to get a large list of pages
> > with fetch, then boot strap the to crawl list and re-crawl as per
> > http://lucene.apache.org/nutch/tutorial.html.
> > If this isnt the best way to perform a large crawl, please provide
> > suggestions. I dont know if Nutch has any tools for building a web graph,
> > but i would have no trouble building it on my own, if i knew how to access
> > the pages contents. Unfortunately I have no idea how to do this. once pages
> > are fetched, how does one view the HTML data?
> >
>
>

Reply via email to