[ 
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274271#comment-13274271
 ] 

Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

Discussing this further with Robert, it looks like this is a (smaller) part of 
a larger issue, in that SegmentInfo+FieldInfo should be made extensible and the 
process of reading/writing this information should be *completely 
codec-specific*. Let's make a separate issue for that part.

And the smaller issue discussed here is to record only the information about a 
commit point in a *completely codec-independent, versioned format*, whatever 
that format is. Let's call it CommitInfo or whatever other name fits. This part 
would be written to a file that is separate from the codec-dependent parts.

Regarding two-phase commit and checksums - one reason we have 
SegmentInfosWriter/Reader was the AppendingCodec, because we couldn't make it 
work for append-only filesystems. However, we could change the two-phase commit 
implementation to the following:

* write the data to the CommitInfo file
* write a marker indicating "end of data, checksum follows"
* finally, write the checksum

Then the reading code knows that:
* if there's a marker missing then the file is invalid
* if the marker is present then the checksum must be present too
* and the checksum must be correct.

This implementation doesn't require seek back / overwrite so it's supported on 
any filesystem.
                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use 
> plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs 
> were used for writing each of the segments that the commit point consists of. 
> However, this is a chicken and egg situation - in theory the format of this 
> file is customizable via Codec.getSegmentInfosFormat, but in practice we have 
> to first discover what is the codec implementation that wrote this file - so 
> the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of 
> this file that contains the codec name... and then the file is read again, 
> only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either 
> line oriented properties or JSON, in such a way that newer versions could 
> easily extend it, and which wouldn't require any special Codec to read and 
> parse. Consequently we could remove SegmentInfosFormat altogether, and 
> instead add SegmentInfoFormat (notice the singular) to Codec to read single 
> per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we 
> could either add another file or we could extend the .fnm file (FieldInfos) 
> to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class 
> names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to