[jira] [Reopened] (NUTCH-2155) Create a "crawl completeness" utility

Sebastian Nagel (JIRA) Sun, 01 Nov 2015 04:19:51 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel reopened NUTCH-2155:
------------------------------------

When running the completion statistics on a CrawlDb, an exception is thrown
{noformat}
% nutch crawlcomplete
usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
% nutch crawlcomplete ./crawl/crawldb completion_stats domain
Exception in thread "main" java.io.FileNotFoundException: File 
file:.../crawl/crawldb/old/data does not exist
        at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
{noformat}
I had to take a look into the code to figure out that the parameter <inputdirs> 
is expected as comma-separated list of CrawlDb sequence files. The following 
command works:
{noformat}
% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
{noformat}
All Nutch tools and utils operating on CrawlDb take just the bare path without 
the current/ subdirectory. Shouldn't the crawlcomplete command behave the same?
To pass more than one CrawlDb may be useful sometimes. However, usually crawls 
(and their dbs) are disjoint. If they are not the completeness statistics are 
probably not correct due to duplicates.

> Create a "crawl completeness" utility
> -------------------------------------
>
>                 Key: NUTCH-2155
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2155
>             Project: Nutch
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 1.10
>            Reporter: Michael Joyce
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" 
> information from a crawl similar to how domainstats does but including 
> fetched and unfetched counts per domain/host. This is especially nice when 
> doing vertical crawls over a few domains or just to see how much of a 
> host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (NUTCH-2155) Create a "crawl completeness" utility

Reply via email to