[Nutch-dev] Re: Analyze command purpose ....

Daniel D. Sat, 18 Jun 2005 11:48:11 -0700

Andrej,
 Thanks a lot for the ansewrs. 
Sorry for being persistent in my posts .. .. I was going on vacation for 3 
weeks and needed to finish my work before. I appriciate your help.
 Reagrds,
Daniel


 On 6/16/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: 
> 
> Daniel D. wrote:
> > Dear Nutch Developers,
> >
> > I'm trying to get answers to my questions below but nobody is 
> responding.
> > This is why I'm trying to post my questions again.
> 
> Hi Daniel,
> 
> Please see my answers below. Sometimes it takes patience, people have
> busy schedules...
> 
> >
> > ----------- Question # 1 ------------------------
> > As I understand Nutch crawler is employing crawl & stop with threshold 
> is
> > used with –topN parameter. Please correct me if I'm wrong. This also 
> means
> > that some sites will have different depth the others.
> 
> Yes and no - some pages that are located deep could have a high score
> (because of many inlinks), so they would be put on the list for
> fetching, even though pages that are closer to root URL may not have
> been fetched yet, or indeed will never be fetcher because they score too
> low.
> 
> >
> > Is there a way to control the crawling depth per domain and number of 
> URLS
> > per domain as well as the total number of domains crawled (in this case 
> it's
> > - topN).
> 
> -topN controls fetching by score. What you want is to control fetching
> by depth. Currently the FetchListTool doesn't implement this, but it
> would be trivial to add.
> 
> >
> > ----------- Question # 2 ------------------------
> > The whole-web crawling tutorial advices to use the following command
> > sequence:
> >
> > Fetch
> >
> > updatedb db
> >
> > and then generate db segments -topN 1000
> >
> > Use of the topN parameter implies that updatedb db doing some analysis 
> on
> > fetched data. Command analyze (net.nutch.tools.LinkAnalysisTool) is not
> > being mentioned in tutorial. DissectingTheNutchCrawler (
> > http://wiki.apache.org/nutch/DissectingTheNutchCrawler) article includes
> > this command in the sequance of command for the whole-internaet 
> crawling.
> >
> > When should I use command analyze and when might I not use it?
> 
> With the default settings you don't need to use this command. Nutch
> approximates the full web-graph scoring by using scoring based on the
> number of inlinks. Additionally, this command is known to be slightly
> broken...
> 
> >
> > I'm trying to get sense on how much memory (hard-drive and RAM) webDB 
> will
> > require and now I also will concern about how much machine resources 
> will
> > analyze consume. Nobody provide this information yet. I will appreciate 
> if
> > somebody will share his knowledge and thoughts here.
> 
> Don't use analyze - it will consume any disk space that you throw at it 
> ;-)
> 
> WebDB normally consumes ca. 2kB per page. This may temporarily increase
> to 3x this number during DB updating.
> 
> >
> > I'm looking for something like: for 1,000,000
> > documents WebDB will take approximately XX GB and running bin/nutch
> > updatedb on 1,000,000 will use up to XX MB of RAM.
> 
> The last figure depends on the settings of your JVM, i.e. what heap size
> you set for JVM. Updatedb should not consume much memory in any case.
> 
> >
> >
> > ----------- Question # 3 ------------------------
> >
> >
> > After initial inject and subsequent fetch and updatedb command (s) can I 
> use
> > inject to add more URLS to the WebDB ?
> 
> Yes, of course.
> 
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
> 
>

[Nutch-dev] Re: Analyze command purpose ....

Reply via email to