[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent

Andrzej Bialecki (JIRA) Tue, 15 May 2012 16:15:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276325#comment-13276325
 ]


Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

bq. The only small thing we lose is if a disk full is going to strike... 
I thought about this too - if it's really a big concern we could use the 
following trick: > 99% filesystems keep data in blocks that are multiples of 
512 bytes. We could add filler bytes at the end of the file so that it comes 
out to a round multiple of 512 B, and only then append the marker and the 
checksum. This way we will know that writing a marker required allocation of a 
new block, and if it succeeded then writing a checksum should also succeed.
                
> Make segments_NN file codec-independent
> ---------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use 
> plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs 
> were used for writing each of the segments that the commit point consists of. 
> However, this is a chicken and egg situation - in theory the format of this 
> file is customizable via Codec.getSegmentInfosFormat, but in practice we have 
> to first discover what is the codec implementation that wrote this file - so 
> the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of 
> this file that contains the codec name... and then the file is read again, 
> only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either 
> line oriented properties or JSON, in such a way that newer versions could 
> easily extend it, and which wouldn't require any special Codec to read and 
> parse. Consequently we could remove SegmentInfosFormat altogether, and 
> instead add SegmentInfoFormat (notice the singular) to Codec to read single 
> per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we 
> could either add another file or we could extend the .fnm file (FieldInfos) 
> to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class 
> names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4050) Make segments_NN file codec-independent

Reply via email to