[Nutch-general] Re: Getting contents of crawled pages by URL

TDLN Mon, 27 Mar 2006 04:46:05 -0800

Wojchiech,

1. list of crawled pages



There's the 'nutch admin' command:

java org.apache.nutch.tools.WebDBAdminTool (-local | -ndfs <namenode:port>)
db [-create] [-textdump dumpPrefix] [-scoredump] [-top k]

Using '-textDump' will dump the contents of the WebDB to a text file.

Then there is the 'nutch readdb' command that you can use to dump all pages
or links:

java org.apache.nutch.db.WebDBReader (-local | -ndfs <namenode:port>) <db>
[-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | [-dumppagemd5] |
[-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks] |
[-dumplink url] | [-showlinks url] | [-showlinksdeep url] | [-stats] |
[-detailstats]

2. crawled content by URL (of course if page is crawled successfully)
> How can we achieve this? I would appreciate if someone more proficient
> would point us where to look.


In 0.8 this should be possible. Did you check the methods in NutchBean or
FetcherOutput?

Rgrds, Thomas

We are using Nutch 0.8x with hadoop dfs on multiple machines
>
> Thanks in advance,
> Wojtek
>
>

[Nutch-general] Re: Getting contents of crawled pages by URL

Reply via email to