Github user tokee commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/525#discussion_r244328885 --- Diff: lucene/core/src/java/org/apache/lucene/codecs/lucene80/IndexedDISI.java --- @@ -0,0 +1,542 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene80; + +import java.io.DataInput; +import java.io.IOException; + +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.store.IndexOutput; +import org.apache.lucene.store.RandomAccessInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BitSetIterator; +import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.RoaringDocIdSet; + +/** + * Disk-based implementation of a {@link DocIdSetIterator} which can return + * the index of the current document, i.e. the ordinal of the current document + * among the list of documents that this iterator can return. This is useful + * to implement sparse doc values by only having to encode values for documents + * that actually have a value. + * <p>Implementation-wise, this {@link DocIdSetIterator} is inspired of + * {@link RoaringDocIdSet roaring bitmaps} and encodes ranges of {@code 65536} + * documents independently and picks between 3 encodings depending on the + * density of the range:<ul> + * <li>{@code ALL} if the range contains 65536 documents exactly, + * <li>{@code DENSE} if the range contains 4096 documents or more; in that + * case documents are stored in a bit set, + * <li>{@code SPARSE} otherwise, and the lower 16 bits of the doc IDs are + * stored in a {@link DataInput#readShort() short}. + * </ul> + * <p>Only ranges that contain at least one value are encoded. + * <p>This implementation uses 6 bytes per document in the worst-case, which happens + * in the case that all ranges contain exactly one document. + * + * + * To avoid O(n) lookup time complexity, with n being the number of documents, two lookup + * tables are used: A lookup table for block blockCache and index, and a rank structure + * for DENSE block lookups. + * + * The lookup table is an array of {@code long}s with an entry for each block. It allows for + * direct jumping to the block, as opposed to iteration from the current position and forward + * one block at a time. + * + * Each long entry consists of 2 logical parts: + * + * The first 31 bits hold the index (number of set bits in the blocks) up to just before the + * wanted block. The next 33 bits holds the offset in bytes into the underlying slice. + * As there is a maximum of 2^16 blocks, it follows that the maximum size of any block must + * not exceed 2^17 bits to avoid overflow. This is currently the case, with the largest + * block being DENSE and using 2^16 + 288 bits, and is likely to continue to hold as using + * more than double the amount of bits is unlikely to be an efficient representation. + * The cache overhead is numDocs/1024 bytes. --- End diff -- Nice catch. That was me mixing bits & bytes and arriving at too harsh a requirement, making the representation needlessly complicated: All block types are < 2^17 *bits*, but the offset is in *bytes*, so we do not require 2^16 * 2^17 bits to hold it; only 2^16 * 2^17 / 2^3 = 30 bits. With the offset-requirement lowered from 33 to 30 bits, it is much more natural to represent offset & index as two ints. This will be in the next commit.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org