[Nutch-general] Re: Getting contents of crawled pages by URL

TDLN Wed, 29 Mar 2006 23:55:04 -0800

Wojtek,

those commands apply to 0.7.1 (the version I am still working with).


For 0.8 I think you can use 'nutch readdb' and 'nutch readlinkdb'.

How to get the Content by URL, I don't know, but it should be possible
somehow on 0.8.

Rgrds, Thomas



On 3/27/06, TDLN <[EMAIL PROTECTED]> wrote:
>
> Wojchiech,
>
> 1. list of crawled pages
>
>
> There's the 'nutch admin' command:
>
> java org.apache.nutch.tools.WebDBAdminTool (-local | -ndfs
> <namenode:port>) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k]
>
> Using '-textDump' will dump the contents of the WebDB to a text file.
>
> Then there is the 'nutch readdb' command that you can use to dump all
> pages or links:
>
> java org.apache.nutch.db.WebDBReader (-local | -ndfs <namenode:port>) <db>
> [-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | [-dumppagemd5] |
> [-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks] |
> [-dumplink url] | [-showlinks url] | [-showlinksdeep url] | [-stats] |
> [-detailstats]
>
> 2. crawled content by URL (of course if page is crawled successfully)
> > How can we achieve this? I would appreciate if someone more proficient
> > would point us where to look.
>
>
> In 0.8 this should be possible. Did you check the methods in NutchBean or
> FetcherOutput?
>
> Rgrds, Thomas
>
> We are using Nutch 0.8x with hadoop dfs on multiple machines
> >
> > Thanks in advance,
> > Wojtek
> >
> >
>

[Nutch-general] Re: Getting contents of crawled pages by URL

Reply via email to