[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Michael McCandless (JIRA) Sat, 11 Nov 2006 08:04:46 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-532?page=comments#action_12448989 ] 
            
Michael McCandless commented on LUCENE-532:
-------------------------------------------


Alas, in trying to change the CFS format so that file offsets are stored at the 
end of the file, when implementing the corresponding changes to 
CompoundFileReader, I discovered that this approach isn't viable.  I had been 
thinking the reader would look at the file length, subtract 
numEntry*sizeof(long), seek to there, and then read the offsets (longs).  The 
problem is: we can't know sizeof(long) since this is dependent on the actual 
storage implementation, ie, for the same reasoning above.  Ie we can't assume a 
byte = 1 file position, always.

So, then, the only solution I can think of (to avoid seek during write) would 
be to write to a separate file, for each *.cfs file, that contains the file 
offsets corresponding to the cfs file.  Eg, if we have _1.cfs we would also 
have _1.cfsx which holds the file offsets.   This is sort of costly if we care 
about # files (it doubles the number of files in the simple case of a bunch of 
segments w/ no deletes/separate norms).

Yonik had actually mentioned in LUCENE-704 that fixing CFS writing to not use 
seek was not very important, ie, it would be OK to not use compound files with 
HDFS as the store.

Does anyone see a better approach?

> [PATCH] Indexing on Hadoop distributed file system
> --------------------------------------------------
>
>                 Key: LUCENE-532
>                 URL: http://issues.apache.org/jira/browse/LUCENE-532
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 1.9
>            Reporter: Igor Bolotin
>            Priority: Minor
>         Attachments: indexOnDFS.patch, SegmentTermEnum.patch, 
> TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on 
> Hadoop distributed file system. When we tried to do it directly on DFS using 
> Nutch FsDirectory class - we immediately found that indexing fails because 
> DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason 
> for this behavior is clear - DFS does not support random updates and so 
> seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need 
> them? Search in the Lucene code revealed 2 places which call 
> IndexOutput.seek() method: one is in TermInfosWriter and another one in 
> CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the 
> only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total 
> number of terms in the file back into the beginning of the file. It was very 
> simple to change file format a little bit and write number of terms into last 
> 8 bytes of the file instead of writing them into beginning of file. The only 
> other place that should be fixed in order for this to work is in 
> SegmentTermEnum constructor - to read this piece of information at position = 
> file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index 
> directly to DFS without any problems. Well - we still don't index directly to 
> DFS for performance reasons, but at least we can build small local indexes 
> and merge them into the main index on DFS without copying big main index back 
> and forth. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

Reply via email to