[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

Michael McCandless (JIRA) Wed, 07 Apr 2010 03:10:58 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854409#action_12854409
 ]


Michael McCandless commented on LUCENE-2373:
--------------------------------------------

I would love to make Lucene truly write once (and moreve IndexOutput.seek), 
but... this approach makes me a little nervous...

In some environments, relying on the length of the file to be accurate might be 
risky: it's metadata, that can be subject to different client-side caching than 
the file's contents.  EG on NFS I've seen issues where the file length was 
stale yet the file contents were not.

Maybe we could offer a separate codec that takes this approach, for use on 
filesystems like HDFS that can't seek during write?  We should refactor 
standard codec so that "where this long gets stored" can be easily overridden 
by a subclass.

Or, alternatively, we could write this "index of the index" to a separate file?

> Change StandardTermsDictWriter to work with streaming and append-only 
> filesystems
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-2373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2373
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Andrzej Bialecki 
>             Fix For: 3.1
>
>
> Since early 2.x times Lucene used a skip/seek/write trick to patch the length 
> of the terms dict into a place near the start of the output data file. This 
> however made it impossible to use Lucene with append-only filesystems such as 
> HDFS.
> In the post-flex trunk the following code in StandardTermsDictWriter 
> initiates this:
> {code}
>     // Count indexed fields up front
>     CodecUtil.writeHeader(out, CODEC_NAME, VERSION_CURRENT); 
>     out.writeLong(0);                             // leave space for end 
> index pointer
> {code}
> and completes this in close():
> {code}
>       out.seek(CodecUtil.headerLength(CODEC_NAME));
>       out.writeLong(dirStart);
> {code}
> I propose to change this layout so that this pointer is stored simply at the 
> end of the file. It's always 8 bytes long, and we known the final length of 
> the file from Directory, so it's a single additional seek(length - 8) to read 
> it, which is not much considering the benefits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2373) Change StandardTermsDictWriter to work with streaming and append-only filesystems

Reply via email to