Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/nutch_readdb" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/bin/nutch_readdb?action=diff&rev1=8&rev2=9 Comment: Update to reflect Nutch 1.3 API The CrawlDbReader implements all the read-only parts of accessing our web database. It provides us with a read utility for the CrawlDB. Usage: + {{{ - bin/nutch org.apache.nutch.crawl.CrawlDbReader (-local | -ndfs <namenode:port>) <db> [-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | [-dumppagemd5] | [-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks] | [-stats] + bin/nutch org.apache.nutch.crawl.CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>) - }}} + }}} - '''(-local | -ndfs <namenode:port>)''': + '''<crawldb>''': The location of the crawldb directory we wish to read and obtain information from. - '''<db>''': + '''-stats''': This prints the overall statistics to System.out. - '''[-pageurl url]''': + '''-dump <out_dir>''': Enables us to dump the whole crawldb to a text file in any <out_dir> we wish to specify. - '''[-pagemd5 md5]''': + '''-topN <nnnn> <out_dir> [<min>]''': This dumps the top <nnnn> urls sorted by score relevance to any <out_dir> we wish to specify. If the [<min>] parameter is passed in the command the reader will skip records with scores below this particluar value. This can significantly improve retrieval performance of statistics or crawldb dump results. - '''[-dumppageurl]''': + '''-url <url>''': This simply prints information of any particular <url> to System.out. - '''[-dumppagemd5]''': - '''[-toppages <k>]''': - - '''[-linkurl url]''': - - '''[-linkmd5 md5]''': - - '''[-dumplinks]''': - - '''[-stats]''': CommandLineOptions

