Hey,

Well, one method of doing that would be to do a merge of all the segments, and use that to do the lookup. However, I'm not sure what the pros and cons of this method are! Any insights would be welcome.

Cheers
Viksit

Otis Gospodnetic wrote:
I don't think that's doable, as I *think* CrawlDb doesn't know which segment 
the URL is in (or does it?  Not looking at the code now, sorry).


But, knowing the segment you should be able to pull the web page data out.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Viksit Gaur <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, June 12, 2008 2:22:09 AM
Subject: Retrieving data for a particular URL from crawldb?

Hi all,

Is there a way to retrieve a particular page from the nutch crawl using the URL as a key? Since I don't know the segment directory which this page was put into, I can't use nutch readseg. But that tool only gives stats about the URL and not its contents.

Any ideas on the best way to do this?

Thanks,
Viksit


Reply via email to