[
https://issues.apache.org/jira/browse/LUCENE-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-5969:
--------------------------------
Attachment: LUCENE-5969_part2.patch
I think the branch is currently in a good state to do an intermediate merge.
Then we can tackle postings and docvalues.
This patch can be applied, but its large because of lots of svn moves.
* All per-segment files are moved to write/checkSegmentHeader , and they also
verify segment suffix/generation to fully detect mismatched files. I fixed all
5.0 (except dv/postings, still TODO) and all of codecs/ to do this.
* All 5.0 init methods (except dv/postings, and a couple guys in codecs/: still
TODO) use the new checkFooter(in, Throwable) to append suppressed checksum
status if they hit corruption on open.
* CFS is moved to the codec API, with a write method that handles all files at
once, and a read method that returns read-only directory view. Added a new
simpler impl for 5.0, and a simpletext impl. Moved all CFS tests to
BaseCompoundFormatTestCase which they all use. SegmentReader no longer opens
the CFS file twice.
* Merging uses codec producer APIs instead of readers. This leads to more
optimized merging: checksum computation is per-segment/per-producer, and norms
and docvalues don't pile up unused fields into RAM during merge. If the fields
are already loaded, they use them, but otherwise they load the field, but don't
cache it. This is important not just for "abuse" cases, but should really
improve use cases like offline indexing. I fixed all codecs (5.0, codecs/,
backwards/) to not waste RAM like this.
* 5.0 norms have a new indirect encoding for sparse fields. Currently this is
very conservative as 1/31 to ensure its more efficient in terms of both space
(maximum possible packedints bloat) and time (v log v < maxdoc).
* Backwards codecs are more contained: I tried to reduce visibility, make them
as read-only as possible, ensure all files are deprecated, etc.
> Add Lucene50Codec
> -----------------
>
> Key: LUCENE-5969
> URL: https://issues.apache.org/jira/browse/LUCENE-5969
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: 5.0, Trunk
>
> Attachments: LUCENE-5969.patch, LUCENE-5969.patch,
> LUCENE-5969_part2.patch
>
>
> Spinoff from LUCENE-5952:
> * Fix .si to write Version as 3 ints, not a String that requires parsing at
> read time.
> * Lucene42TermVectorsFormat should not use the same codecName as
> Lucene41StoredFieldsFormat
> It would also be nice if we had a "bumpCodecVersion" script so rolling a new
> codec is not so daunting.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]