[
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reopened NUTCH-2155:
------------------------------------
When running the completion statistics on a CrawlDb, an exception is thrown
{noformat}
% nutch crawlcomplete
usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
% nutch crawlcomplete ./crawl/crawldb completion_stats domain
Exception in thread "main" java.io.FileNotFoundException: File
file:.../crawl/crawldb/old/data does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
{noformat}
I had to take a look into the code to figure out that the parameter <inputdirs>
is expected as comma-separated list of CrawlDb sequence files. The following
command works:
{noformat}
% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
{noformat}
All Nutch tools and utils operating on CrawlDb take just the bare path without
the current/ subdirectory. Shouldn't the crawlcomplete command behave the same?
To pass more than one CrawlDb may be useful sometimes. However, usually crawls
(and their dbs) are disjoint. If they are not the completeness statistics are
probably not correct due to duplicates.
> Create a "crawl completeness" utility
> -------------------------------------
>
> Key: NUTCH-2155
> URL: https://issues.apache.org/jira/browse/NUTCH-2155
> Project: Nutch
> Issue Type: Improvement
> Components: util
> Affects Versions: 1.10
> Reporter: Michael Joyce
> Assignee: Chris A. Mattmann
> Labels: memex
> Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness"
> information from a crawl similar to how domainstats does but including
> fetched and unfetched counts per domain/host. This is especially nice when
> doing vertical crawls over a few domains or just to see how much of a
> host/domain you've covered with your crawl so far.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)