Cool Coder wrote:
Hello, I am just wondering how can I read crawldb and get content of each stored URL. I am not sure whether this can be possible or not.
In Nutch 0.8 and later the page information and link information is stored separately, in CrawlDb and LinkDb. You need to have the linkdb (see bin/nutch invertlinks command), and then you can use LinkDbReader class to retrieve this information. From the command line this is bin/nutch readlinkdb.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
