[ http://issues.apache.org/jira/browse/LUCENE-532?page=all ]
Otis Gospodnetic updated LUCENE-532:
------------------------------------
Attachment: (was: TermInfosWriter.java)
> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
> Key: LUCENE-532
> URL: http://issues.apache.org/jira/browse/LUCENE-532
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 1.9
> Reporter: Igor Bolotin
> Priority: Minor
> Attachments: indexOnDFS.patch, SegmentTermEnum.patch,
> TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on
> Hadoop distributed file system. When we tried to do it directly on DFS using
> Nutch FsDirectory class - we immediately found that indexing fails because
> DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason
> for this behavior is clear - DFS does not support random updates and so
> seek() method can't be supported (at least not easily).
>
> Well, if we can't support random updates - the question is: do we really need
> them? Search in the Lucene code revealed 2 places which call
> IndexOutput.seek() method: one is in TermInfosWriter and another one in
> CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the
> only place that concerned us was in TermInfosWriter.
>
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total
> number of terms in the file back into the beginning of the file. It was very
> simple to change file format a little bit and write number of terms into last
> 8 bytes of the file instead of writing them into beginning of file. The only
> other place that should be fixed in order for this to work is in
> SegmentTermEnum constructor - to read this piece of information at position =
> file length - 8.
>
> With this format hack - we were able to use FsDirectory to write index
> directly to DFS without any problems. Well - we still don't index directly to
> DFS for performance reasons, but at least we can build small local indexes
> and merge them into the main index on DFS without copying big main index back
> and forth.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]