Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch domainstats" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch%20domainstats?action=diff&rev1=1&rev2=2

- There is a tool which may tell you some more about which domains have been 
fetched. You can try it through something like this
+ domainstats is an alias for org.apache.nutch.util.domain.DomainStatistics
  
+ In short its a tool which provides information about which domains have been 
fetched. 
  
  == usage ==
  
  {{{
- $ bin/nutch org.apache.nutch.util.domain.DomainStatistics inputDirs outDir 
host|domain|suffix [numOfReducer]
+ $ bin/nutch DomainStatistics inputDirs outDir host|domain|suffix 
[numOfReducer]
  }}}
  
  == example ==
  
  {{{
- $ bin/nutch org.apache.nutch.util.domain.DomainStatistics 
hdfs://nn:9000/user/otis/crawl/crawldb/current hdfs://nn:9000/user/otis/ds-host 
host 8
+ $ bin/nutch DomainStatistics hdfs://nn:9000/user/otis/crawl/crawldb/current 
hdfs://nn:9000/user/otis/ds-host host 8
  }}}
  
  You can then -cat ds-host file from DFS and pipe it to sort -nrk1 for sorting 
by count, higher count first.

Reply via email to