Wojtek, those commands apply to 0.7.1 (the version I am still working with).
For 0.8 I think you can use 'nutch readdb' and 'nutch readlinkdb'. How to get the Content by URL, I don't know, but it should be possible somehow on 0.8. Rgrds, Thomas On 3/27/06, TDLN <[EMAIL PROTECTED]> wrote: > > Wojchiech, > > 1. list of crawled pages > > > There's the 'nutch admin' command: > > java org.apache.nutch.tools.WebDBAdminTool (-local | -ndfs > <namenode:port>) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k] > > Using '-textDump' will dump the contents of the WebDB to a text file. > > Then there is the 'nutch readdb' command that you can use to dump all > pages or links: > > java org.apache.nutch.db.WebDBReader (-local | -ndfs <namenode:port>) <db> > [-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | [-dumppagemd5] | > [-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks] | > [-dumplink url] | [-showlinks url] | [-showlinksdeep url] | [-stats] | > [-detailstats] > > 2. crawled content by URL (of course if page is crawled successfully) > > How can we achieve this? I would appreciate if someone more proficient > > would point us where to look. > > > In 0.8 this should be possible. Did you check the methods in NutchBean or > FetcherOutput? > > Rgrds, Thomas > > We are using Nutch 0.8x with hadoop dfs on multiple machines > > > > Thanks in advance, > > Wojtek > > > > >
