Is this exposed to the REST API? I might be able to plot this in memex explorer.
On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Sebastian Nagel reopened NUTCH-2155: > ------------------------------------ > > When running the completion statistics on a CrawlDb, an exception is thrown > {noformat} > % nutch crawlcomplete > usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer] > % nutch crawlcomplete ./crawl/crawldb completion_stats domain > Exception in thread "main" java.io.FileNotFoundException: File > file:.../crawl/crawldb/old/data does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) > {noformat} > I had to take a look into the code to figure out that the parameter > <inputdirs> is expected as comma-separated list of CrawlDb sequence files. > The following command works: > {noformat} > % nutch crawlcomplete ./crawl/crawldb/current completion_stats domain > {noformat} > All Nutch tools and utils operating on CrawlDb take just the bare path > without the current/ subdirectory. Shouldn't the crawlcomplete command > behave the same? > To pass more than one CrawlDb may be useful sometimes. However, usually > crawls (and their dbs) are disjoint. If they are not the completeness > statistics are probably not correct due to duplicates. > > > Create a "crawl completeness" utility > > ------------------------------------- > > > > Key: NUTCH-2155 > > URL: https://issues.apache.org/jira/browse/NUTCH-2155 > > Project: Nutch > > Issue Type: Improvement > > Components: util > > Affects Versions: 1.10 > > Reporter: Michael Joyce > > Assignee: Chris A. Mattmann > > Labels: memex > > Fix For: 1.11 > > > > > > I've found it useful to have a tool for dumping some "completeness" > information from a crawl similar to how domainstats does but including > fetched and unfetched counts per domain/host. This is especially nice when > doing vertical crawls over a few domains or just to see how much of a > host/domain you've covered with your crawl so far. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) > -- _______________________________ Aron Ahmadia Computational and Data Scientist [image: Continuum Analytics] <http://continuum.io>

