Re: (NUTCH-1071) Crawldb update to total counts per status

Markus Jelsma Fri, 29 Jul 2011 07:12:14 -0700

Another cycle is complete, the results are incremental but similar.  The 
TOTAL_URLS of stats is reduce_output_records + 1, seems alright.


Can't find any order in the numbers.

On Friday 29 July 2011 11:43:19 Julien Nioche wrote:
> Hi Markus,
> 
> Can't really think of a reason why they could differ. You called 'readdb
> -stats' right after the crawldb?
> Could it be a problem with readdb -stats?  And why are we seeing an
> 'unknown' status in the crawldb update?
> 
> That's definitely interesting
> 
> Julien
> 
> On 29 July 2011 10:23, Markus Jelsma <[email protected]> wrote:
> > Hi Julien,
> > 
> > Can you explain the following? I've got here some output from a readdb
> > -stats
> > job and the output of the most recent crawldb update job. They differ a
> > lot!
> > 
> > update:
> > db_redir_temp   0       1,036,840       1,036,840
> > db_redir_perm   0       1,195,539       1,195,539
> > unknown         0       2,315   2,315
> > db_unfetched    0       16,909,397      16,909,397
> > db_notmodified  0       1,264,001       1,264,001
> > db_gone         0       955,701         955,701
> > db_fetched      0       19,545,591      19,545,591
> > 
> > stats:
> > TOTAL urls: 40909384
> > status 1 (db_unfetched):    26788643
> > status 2 (db_fetched):      12345476
> > status 3 (db_gone): 763463
> > status 4 (db_redir_temp):   461511
> > status 5 (db_redir_perm):   431595
> > status 6 (db_notmodified):  118696
> > 
> > Thanks
> > 
> > > Crawldb update to total counts per status
> > > -----------------------------------------
> > > 
> > >                  Key: NUTCH-1071
> > >                  URL: https://issues.apache.org/jira/browse/NUTCH-1071
> > >              
> > >              Project: Nutch
> > >           
> > >           Issue Type: Improvement
> > >     
> > >     Affects Versions: 1.4
> > >     
> > >             Reporter: Julien Nioche
> > >             Assignee: Julien Nioche
> > >             Priority: Trivial
> > >             
> > >              Fix For: 1.4
> > > 
> > > The reduce phase of the crawldb update outputs all the entries that
> > > will
> > 
> > be
> > 
> > > found in the updated crawldb. We can use the counters to summarise the
> > > number of URLs per status, which is a bit like the readdb -stats
> > > functionality except that it does not require an additional step. This
> > > is a useful way of monitoring the progress of a crawl using the Hadoop
> > > JobTracker UI.
> > > 
> > > --
> > > This message is automatically generated by JIRA.
> > 
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: (NUTCH-1071) Crawldb update to total counts per status

Reply via email to