That's a good point. I think the Lucene index might be the only place where that information is stored. If you really needed it I guess you could build your own mapping of URL to segment. However, I am not that familiar with Nutch so I will let someone with more experience answer this.
On 10/15/06, shjiang <[EMAIL PROTECTED]> wrote:
from the version 0.8 ,"bin/nutch segread" has been replaced by "bin/nutch readseg" I try the command :bin/nutch readseg -get ./crawl/segments/20061013144233/ http://www.nokia.com.cn/ I can get the entire content of the url. The problem is that there are sevaral segments directories under ./crawl/segments/ ,how can i know the content of the specified url in which segment. > You could do it from the command line using bin/nutch segread, or you > could do it in Java by opening map file readers on the directories > called "content" found in each segment. > > On 10/15/06, shjiang <[EMAIL PROTECTED]> wrote: >> I cannot find any api that support this function to read the content of >> a specified url from the crawldb. >> > >
