I would like to see lucene operate with hadoop As you rightly pointed out, writing using FSDirectory to DFS would be a performance issue.
I am interested in the idea. But i do not know how much time i can contribute to this because of the little time which i can spare. If anyone else is interested, can they join ? We can work on this together Rgds Prabhu On 3/26/06, Igor Bolotin <[EMAIL PROTECTED]> wrote: > > In my current project we needed a way to create very large Lucene indexes > on > Hadoop distributed file system. When we tried to do it directly on DFS > using > Nutch FsDirectory class - we immediately found that indexing fails because > DfsIndexOutput.seek() method throws UnsupportedOperationException. The > reason for this behavior is clear - DFS does not support random updates > and > so seek() method can't be supported (at least not easily). > > Well, if we can't support random updates - the question is: do we really > need them? Search in the Lucene code revealed 2 places which call > IndexOutput.seek() method: one is in TermInfosWriter and another one in > CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the > only place that concerned us was in TermInfosWriter. > > TermInfosWriter uses IndexOutput.seek() in its close() method to write > total > number of terms in the file back into the beginning of the file. It was > very > simple to change file format a little bit and write number of terms into > last 8 bytes of the file instead of writing them into beginning of file. > The > only other place that should be fixed in order for this to work is in > SegmentTermEnum constructor - to read this piece of information at > position > = file length - 8. > > With this format hack - we were able to use FsDirectory to write index > directly to DFS without any problems. Well - we still don't index directly > to DFS for performance reasons, but at least we can build small local > indexes and merge them into the main index on DFS without copying big main > index back and forth. > > If somebody is interested - I can post our changes in TermInfosWriter and > SegmentTermEnum code, although they are pretty trivial. > > Best regards! > Igor > >