[jira] [Updated] (LUCENE-5969) Add Lucene50Codec

Robert Muir (JIRA) Sat, 04 Oct 2014 09:58:47 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-5969:
--------------------------------
    Attachment: LUCENE-5969_part2.patch

I think the branch is currently in a good state to do an intermediate merge. 
Then we can tackle postings and docvalues.
This patch can be applied, but its large because of lots of svn moves.

* All per-segment files are moved to write/checkSegmentHeader , and they also 
verify segment suffix/generation to fully detect mismatched files. I fixed all 
5.0 (except dv/postings, still TODO) and all of codecs/ to do this.
* All 5.0 init methods (except dv/postings, and a couple guys in codecs/: still 
TODO) use the new checkFooter(in, Throwable) to append suppressed checksum 
status if they hit corruption on open.
* CFS is moved to the codec API, with a write method that handles all files at 
once, and a read method that returns read-only directory view. Added a new 
simpler impl for 5.0, and a simpletext impl. Moved all CFS tests to 
BaseCompoundFormatTestCase which they all use. SegmentReader no longer opens 
the CFS file twice.
* Merging uses codec producer APIs instead of readers. This leads to more 
optimized merging: checksum computation is per-segment/per-producer, and norms 
and docvalues don't pile up unused fields into RAM during merge. If the fields 
are already loaded, they use them, but otherwise they load the field, but don't 
cache it. This is important not just for "abuse" cases, but should really 
improve use cases like offline indexing. I fixed all codecs (5.0, codecs/, 
backwards/) to not waste RAM like this.
* 5.0 norms have a new indirect encoding for sparse fields. Currently this is 
very conservative as 1/31 to ensure its more efficient in terms of both space 
(maximum possible packedints bloat) and time (v log v < maxdoc). 
* Backwards codecs are more contained: I tried to reduce visibility, make them 
as read-only as possible, ensure all files are deprecated, etc.


> Add Lucene50Codec
> -----------------
>
>                 Key: LUCENE-5969
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5969
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: 5.0, Trunk
>
>         Attachments: LUCENE-5969.patch, LUCENE-5969.patch, 
> LUCENE-5969_part2.patch
>
>
> Spinoff from LUCENE-5952:
>   * Fix .si to write Version as 3 ints, not a String that requires parsing at 
> read time.
>   * Lucene42TermVectorsFormat should not use the same codecName as 
> Lucene41StoredFieldsFormat
> It would also be nice if we had a "bumpCodecVersion" script so rolling a new 
> codec is not so daunting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5969) Add Lucene50Codec

Reply via email to