Tim Patton wrote:
I'm working on a project that uses pieces of Nutch to store a Lucene index in Hadoop (basically I am using the FsDirectory and related classes). When trying to write to an index I got an unsupported exception since FsDirectory doesn't support "seek" which Lucene uses on closing an IndexWriter, the file system is write-once. After looking through the Nutch code I saw that an index is worked on locally, either with writing or merging, then transferred into the dfs when finished. I just was checking to make sure I understood this correctly.
Yes, this is correct.
If I was to work on a multi-gigabyte index I would need that much free space on my local drive to transfer the index to and it would take a while to copy each way. How does this work for the really huge indexes people want to build with Nutch? Would there be many smaller Lucene indexes in the dfs, since obviously one huge terabyte index couldn't be downloaded? I'm just trying to have a better understanding of how Nutch works.
Terabyte indexes aren't actually very useful, since they take too long to search. So with big collections (>100M pages) one will keep multiple indexes and use distributed search to search them all in parallel.
Doug ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
