[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

Andrzej Bialecki (JIRA) Sat, 12 May 2012 13:15:12 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13274064#comment-13274064
 ]


Andrzej Bialecki  commented on LUCENE-4050:
-------------------------------------------

bq. Plain text encoding of these files would be really nice but isn't as 
important, I think...

Yeah, it could be sufficient if we would agree on necessarily separate the 
"plain list of segments:codec" from the segmentInfo/fieldInfo parts and push 
those parts down to the codec-specific formats.

Then we could just use a version number as the first element of this file to 
allow for extensions in the future, like e.g. switching to JSON or to some 
other format du jour.

bq. Surely this is just some problem only on windows 3.1 and java 1.2 or 
something and now fixed, since this is how every other linux/cygwin program 
(e.g. vi) works.

I'm not so sure. I know for a fact that Windows doesn't allow renames or 
deletes of open files, no matter if it's open by you or by some other process 
(e.g. user examining the file in Notepad.exe), and IIRC the issue was that JVM 
doesn't release OS file handles quickly enough.
                
> Change SegmentInfos format to plain text
> ----------------------------------------
>
>                 Key: LUCENE-4050
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4050
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Andrzej Bialecki 
>             Fix For: 4.0
>
>
> I propose to change the format of SegmentInfos file (segments_NN) to use 
> plain text instead of the current binary format.
> SegmentInfos file represents a commit point, and it also declares what codecs 
> were used for writing each of the segments that the commit point consists of. 
> However, this is a chicken and egg situation - in theory the format of this 
> file is customizable via Codec.getSegmentInfosFormat, but in practice we have 
> to first discover what is the codec implementation that wrote this file - so 
> the SegmentCoreReaders assumes a certain fixed binary layout of a preamble of 
> this file that contains the codec name... and then the file is read again, 
> only this time using the right Codec.
> This is ugly. Instead I propose to use a simple plain text format, either 
> line oriented properties or JSON, in such a way that newer versions could 
> easily extend it, and which wouldn't require any special Codec to read and 
> parse. Consequently we could remove SegmentInfosFormat altogether, and 
> instead add SegmentInfoFormat (notice the singular) to Codec to read single 
> per-segment SegmentInfo-s in a codec-specific way. E.g. for Lucene40 codec we 
> could either add another file or we could extend the .fnm file (FieldInfos) 
> to contain also this information. 
> Then the plain text SegmentInfos would contain just the following information:
> * list of global files for this commit point (if any)
> * list of segments for this commit point, and their corresponding codec class 
> names
> * user data map

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4050) Change SegmentInfos format to plain text

Reply via email to