[Nutch Wiki] Update of "bin/nutch readdb" by kiranchitturi

Apache Wiki Wed, 20 Mar 2013 11:06:10 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch readdb" page has been changed by kiranchitturi:
http://wiki.apache.org/nutch/bin/nutch%20readdb

New page:
Readdb is an alias for org.apache.nutch.crawl.CrawlDbReader

The CrawlDbReader implements all the read-only parts of accessing our web 
database. It provides us with a read utility for the crawldb.

Usage: 

{{{
bin/nutch readdb <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> 
[<min>] | -url <url>)
}}} 

'''<crawldb>''': The location of the crawldb directory we wish to read and 
obtain information from.

'''-stats''': This prints the overall statistics to System.out.

'''-dump <out_dir>''': Enables us to dump the whole crawldb to a text file in 
any <out_dir> we wish to specify.

'''[-regex <expr>]''': filter records with a regular expression

'''[-status <status>]''': filter records by CrawlDatum status

'''-topN <nnnn> <out_dir> [<min>]''': This dumps the top <nnnn> urls sorted by 
score relevance to any <out_dir> we wish to specify. If the [<min>] parameter 
is passed in the command the reader will skip records with scores below this 
particluar value. This can significantly improve retrieval performance of 
statistics or crawldb dump results.

'''-url <url>''': This simply prints information of any particular <url> to 
System.out.



CommandLineOptions

[Nutch Wiki] Update of "bin/nutch readdb" by kiranchitturi

Reply via email to