Wojchiech, 1. list of crawled pages
There's the 'nutch admin' command: java org.apache.nutch.tools.WebDBAdminTool (-local | -ndfs <namenode:port>) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k] Using '-textDump' will dump the contents of the WebDB to a text file. Then there is the 'nutch readdb' command that you can use to dump all pages or links: java org.apache.nutch.db.WebDBReader (-local | -ndfs <namenode:port>) <db> [-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | [-dumppagemd5] | [-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks] | [-dumplink url] | [-showlinks url] | [-showlinksdeep url] | [-stats] | [-detailstats] 2. crawled content by URL (of course if page is crawled successfully) > How can we achieve this? I would appreciate if someone more proficient > would point us where to look. In 0.8 this should be possible. Did you check the methods in NutchBean or FetcherOutput? Rgrds, Thomas We are using Nutch 0.8x with hadoop dfs on multiple machines > > Thanks in advance, > Wojtek > >
