Re: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Chris Hostetter Sat, 29 Jul 2006 17:55:43 -0700

: This actually looks like a good and patch that doesn't break any tests.
: I'll commit it in the coming days, as it looks like it should be
: backwards compatible... except CFS won't be supported unless somebody
: patches that, too (I tried quickly and soon got unit tests to fail :( ).


The one thing the Unit tests can't verify is that the version checking is
working (ie: that the new code will still read old indexes properly).  The
patch looks fairly straight forward to me so I don't think that will be a
problem -- but you may want to try searching an index built with 2.0 after
you apply the patch just to sanity check it.

This will neccessitate "Lucene 2.1" being the next version released from
the trunk because of the file format change correct? ... i have no
objection to that, i just want to verify my understanding of the file
format versioning.



:
: > [PATCH] Indexing on Hadoop distributed file system
: > --------------------------------------------------
: >
: >                 Key: LUCENE-532
: >                 URL: http://issues.apache.org/jira/browse/LUCENE-532
: >             Project: Lucene - Java
: >          Issue Type: Improvement
: >          Components: Index
: >    Affects Versions: 1.9
: >            Reporter: Igor Bolotin
: >            Priority: Minor
: >         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, 
TermInfosWriter.patch
: >
: >
: > In my current project we needed a way to create very large Lucene indexes 
on Hadoop distributed file system. When we tried to do it directly on DFS using 
Nutch FsDirectory class - we immediately found that indexing fails because 
DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason 
for this behavior is clear - DFS does not support random updates and so seek() 
method can't be supported (at least not easily).
: >
: > Well, if we can't support random updates - the question is: do we really 
need them? Search in the Lucene code revealed 2 places which call 
IndexOutput.seek() method: one is in TermInfosWriter and another one in 
CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the only 
place that concerned us was in TermInfosWriter.
: >
: > TermInfosWriter uses IndexOutput.seek() in its close() method to write 
total number of terms in the file back into the beginning of the file. It was 
very simple to change file format a little bit and write number of terms into 
last 8 bytes of the file instead of writing them into beginning of file. The 
only other place that should be fixed in order for this to work is in 
SegmentTermEnum constructor - to read this piece of information at position = 
file length - 8.
: >
: > With this format hack - we were able to use FsDirectory to write index 
directly to DFS without any problems. Well - we still don't index directly to 
DFS for performance reasons, but at least we can build small local indexes and 
merge them into the main index on DFS without copying big main index back and 
forth.
:
: --
: This message is automatically generated by JIRA.
: -
: If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
: -
: For more information on JIRA, see: http://www.atlassian.com/software/jira
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Reply via email to