I'm working on a project that uses pieces of Nutch to store a Lucene index
in Hadoop (basically I am using the FsDirectory and related classes).  When
trying to write to an index I got an unsupported exception since FsDirectory
doesn't support "seek" which Lucene uses on closing an IndexWriter, the file
system is write-once.  After looking through the Nutch code I saw that an
index is worked on locally, either with writing or merging, then transferred
into the dfs when finished.  I just was checking to make sure I understood
this correctly.  If I was to work on a multi-gigabyte index I would need
that much free space on my local drive to transfer the index to and it would
take a while to copy each way.  How does this work for the really huge
indexes people want to build with Nutch?  Would there be many smaller Lucene
indexes in the dfs, since obviously one huge terabyte index couldn't be
downloaded?  I'm just trying to have a better understanding of how Nutch
works.

 

Thanks,

Tim

 

Reply via email to