Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Aron Ahmadia Sun, 01 Nov 2015 06:01:33 -0800

Is this exposed to the REST API?  I might be able to plot this in memex
explorer.


On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <[email protected]> wrote:

>
>      [
> https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Sebastian Nagel reopened NUTCH-2155:
> ------------------------------------
>
> When running the completion statistics on a CrawlDb, an exception is thrown
> {noformat}
> % nutch crawlcomplete
> usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
> % nutch crawlcomplete ./crawl/crawldb completion_stats domain
> Exception in thread "main" java.io.FileNotFoundException: File
> file:.../crawl/crawldb/old/data does not exist
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
> {noformat}
> I had to take a look into the code to figure out that the parameter
> <inputdirs> is expected as comma-separated list of CrawlDb sequence files.
> The following command works:
> {noformat}
> % nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
> {noformat}
> All Nutch tools and utils operating on CrawlDb take just the bare path
> without the current/ subdirectory. Shouldn't the crawlcomplete command
> behave the same?
> To pass more than one CrawlDb may be useful sometimes. However, usually
> crawls (and their dbs) are disjoint. If they are not the completeness
> statistics are probably not correct due to duplicates.
>
> > Create a "crawl completeness" utility
> > -------------------------------------
> >
> >                 Key: NUTCH-2155
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-2155
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: util
> >    Affects Versions: 1.10
> >            Reporter: Michael Joyce
> >            Assignee: Chris A. Mattmann
> >              Labels: memex
> >             Fix For: 1.11
> >
> >
> > I've found it useful to have a tool for dumping some "completeness"
> information from a crawl similar to how domainstats does but including
> fetched and unfetched counts per domain/host. This is especially nice when
> doing vertical crawls over a few domains or just to see how much of a
> host/domain you've covered with your crawl so far.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


-- 
_______________________________

Aron Ahmadia
Computational and Data Scientist

[image: Continuum Analytics] <http://continuum.io>

Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Reply via email to