Re: Retrieving data for a particular URL from crawldb?

Viksit Gaur Thu, 12 Jun 2008 11:35:45 -0700

Hey,

Well, one method of doing that would be to do a merge of all thesegments, and use that to do the lookup. However, I'm not sure what thepros and cons of this method are! Any insights would be welcome.


Cheers
Viksit

Otis Gospodnetic wrote:

I don't think that's doable, as I *think* CrawlDb doesn't know which segment 
the URL is in (or does it?  Not looking at the code now, sorry).


But, knowing the segment you should be able to pull the web page data out.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Viksit Gaur <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, June 12, 2008 2:22:09 AM
Subject: Retrieving data for a particular URL from crawldb?

Hi all,
Is there a way to retrieve a particular page from the nutch crawl usingthe URL as a key? Since I don't know the segment directory which thispage was put into, I can't use nutch readseg. But that tool only givesstats about the URL and not its contents.
Any ideas on the best way to do this?

Thanks,
Viksit

Re: Retrieving data for a particular URL from crawldb?

Reply via email to