Hey,
Well, one method of doing that would be to do a merge of all the
segments, and use that to do the lookup. However, I'm not sure what the
pros and cons of this method are! Any insights would be welcome.
Cheers
Viksit
Otis Gospodnetic wrote:
I don't think that's doable, as I *think* CrawlDb doesn't know which segment
the URL is in (or does it? Not looking at the code now, sorry).
But, knowing the segment you should be able to pull the web page data out.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Viksit Gaur <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, June 12, 2008 2:22:09 AM
Subject: Retrieving data for a particular URL from crawldb?
Hi all,
Is there a way to retrieve a particular page from the nutch crawl using
the URL as a key? Since I don't know the segment directory which this
page was put into, I can't use nutch readseg. But that tool only gives
stats about the URL and not its contents.
Any ideas on the best way to do this?
Thanks,
Viksit