Thanks, that's exactly what I was thinking.  Do you have any recommendations
on maximum index size (obviously we'd be testing ourselves, but its good to
get an idea)?

Tim

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 02, 2006 7:34 PM
To: [email protected]
Subject: Re: Question about Index Writing/Merging


Tim Patton wrote:
> I'm working on a project that uses pieces of Nutch to store a Lucene index
> in Hadoop (basically I am using the FsDirectory and related classes).
When
> trying to write to an index I got an unsupported exception since
FsDirectory
> doesn't support "seek" which Lucene uses on closing an IndexWriter, the
file
> system is write-once.  After looking through the Nutch code I saw that an
> index is worked on locally, either with writing or merging, then
transferred
> into the dfs when finished.  I just was checking to make sure I understood
> this correctly.

Yes, this is correct.

> If I was to work on a multi-gigabyte index I would need
> that much free space on my local drive to transfer the index to and it
would
> take a while to copy each way.  How does this work for the really huge
> indexes people want to build with Nutch?  Would there be many smaller
Lucene
> indexes in the dfs, since obviously one huge terabyte index couldn't be
> downloaded?  I'm just trying to have a better understanding of how Nutch
> works.

Terabyte indexes aren't actually very useful, since they take too long 
to search.  So with big collections (>100M pages) one will keep multiple 
indexes and use distributed search to search them all in parallel.

Doug



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to