Re: nutch links repository

John Mendenhall Mon, 20 Aug 2007 12:02:42 -0700

> How can I see all the webpages nutch crawled?  In other words, I want to
> know which urls nutch has crawled.
> 
> Are all the urls ever crawled stored in crawlDB?


Run /usr/local/nutch/bin/nutch readdb with the dump
option and it will dump all the urls out into a new
directory and you can peruse it at your leisure.

Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> 
<out_dir> [<min>] | -url <url>)
        <crawldb>       directory name where crawldb is located
        -stats  print overall statistics to System.out
        -dump <out_dir> dump the whole db to a text file in <out_dir>
        -url <url>      print information on <url> to System.out
        -topN <nnnn> <out_dir> [<min>]  dump top <nnnn> urls sorted by score to 
<out_dir>
                [<min>] skip records with scores below this value.
                        This can significantly improve performance.

Or, you can write your own class that outputs
whatever you want from the database...

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Re: nutch links repository

Reply via email to