Re: How to read crawldb

Cool Coder Wed, 28 Nov 2007 11:20:43 -0800

Thanks for information. I tried with ./bin/nutch readlinkdb, however I could 
not able to get all the links. I think I am missing something on proper usage 
pattern of readlinkdb option
  I tried with
   
  $ ./bin/nutch readlinkdb ./nutch-index/linkdb/
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out


  Let me tell you that, nutch-index is the location of nutch index and it has 
following directories
  --crawldb
  --index
  --indexes
--linkdb
  --segments
   
  can you tell me what is the usage patternm, I should use to view all the links
   
  - RB
   
  
Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
  Cool Coder wrote:
> Hello, I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.

In Nutch 0.8 and later the page information and link information is 
stored separately, in CrawlDb and LinkDb. You need to have the linkdb 
(see bin/nutch invertlinks command), and then you can use LinkDbReader 
class to retrieve this information. From the command line this is 
bin/nutch readlinkdb.


-- 
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com



       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: How to read crawldb

Reply via email to