Thanks for information. I tried with ./bin/nutch readlinkdb, however I could
not able to get all the links. I think I am missing something on proper usage
pattern of readlinkdb option
I tried with
$ ./bin/nutch readlinkdb ./nutch-index/linkdb/
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
-dump <out_dir> dump whole link db to a text file in <out_dir>
-url <url> print information about <url> to System.out
Let me tell you that, nutch-index is the location of nutch index and it has
following directories
--crawldb
--index
--indexes
--linkdb
--segments
can you tell me what is the usage patternm, I should use to view all the links
- RB
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
Cool Coder wrote:
> Hello, I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.
In Nutch 0.8 and later the page information and link information is
stored separately, in CrawlDb and LinkDb. You need to have the linkdb
(see bin/nutch invertlinks command), and then you can use LinkDbReader
class to retrieve this information. From the command line this is
bin/nutch readlinkdb.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------
Never miss a thing. Make Yahoo your homepage.