Re: Nutch dataset dirstructure

Enis Soztutar Fri, 30 Mar 2007 00:30:51 -0800

pike wrote:

Hi


I'm new to nutch.
Can anyone point me to some documentation about
the directory structure Nutch creates and maintains
when crawling, indexing etc ? We're doing "whole-web"
crawls step by step. Since I have no reference, it's
hard to see wether crawling, merging, indexing, etc
went ok.


thanks!
*-pike

Well, unfortunately there is not much document out there. But you shouldstart by reading the articles at the nutch wiki first. For the indexstructure you should seek help in the lucene wiki, since nutch useslucene as an inverted index. To look at the generated indexes you canuse luke or lucli(command line) tools. lucli can be found in the contribdirectory of lucene.

Nutch stores the crawl state of the urls in the crawldb. The crawldb isan instance of Hadoop's MapFile, which is a sequence of <key,value>pairs. The keys in crawldb are urls and values are CrawlDatum objects.MapFile uses two SequenceFile s, one for storing the data, the other forindexing the data. You should check the javadocs of these classes forfurther info.


Linkdb is also stored as map files, from urls to Inlink objects.

For further info, you should really browse the javadocs, and skimthrough the code to get a deeper understanding of the system.

Re: Nutch dataset dirstructure

Reply via email to