[GitHub] [druid] gianm commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

GitBox Fri, 21 Oct 2022 17:38:53 -0700


gianm commented on code in PR #12277:
URL: https://github.com/apache/druid/pull/12277#discussion_r992149440



##########
core/src/main/java/org/apache/druid/java/util/common/StringUtils.java:
##########
@@ -108,6 +108,15 @@ public static String fromUtf8(final ByteBuffer buffer)
     return StringUtils.fromUtf8(buffer, buffer.remaining());
   }
 
+  @Nullable
+  public static String fromUtf8Nullable(@Nullable final ByteBuffer buffer)

Review Comment:
   Javadoc should specify whether the buffer position is advanced, and by how 
much.



##########
core/src/main/java/org/apache/druid/segment/data/VByte.java:
##########
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import java.nio.ByteBuffer;
+
+public class VByte
+{
+  /**
+   * Read a variable byte (vbyte) encoded integer from a {@link ByteBuffer} at 
the current position.

Review Comment:
   Javadoc should specify whether the buffer position is advanced, and by how 
much.



##########
processing/src/main/java/org/apache/druid/segment/data/FrontCodedIndexed.java:
##########
@@ -0,0 +1,491 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import com.google.common.base.Preconditions;
+import com.google.common.base.Supplier;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+
+/**
+ * {@link Indexed} specialized for storing variable-width binary values (such 
as utf8 encoded strings), which must be
+ * sorted and unique, using 'front coding'. Front coding is a type of delta 
encoding for byte arrays, where sorted
+ * values are grouped into buckets. The first value of the bucket is written 
entirely, and remaining values are stored
+ * as a pair of an integer which indicates how much of the first byte array of 
the bucket to use as a prefix, followed
+ * by the remaining bytes after the prefix to complete the value.
+ *
+ * Getting a value first picks the appropriate bucket, finds its offset in the 
underlying buffer, then scans the bucket
+ * values to seek to the correct position of the value within the bucket in 
order to reconstruct it using the prefix
+ * length.
+ *
+ * Finding the index of a value involves binary searching the first values of 
each bucket to find the correct bucket,
+ * then a linear scan within the bucket to find the matching value (or 
negative insertion point -1 for values that
+ * are not present).
+ *
+ * The value iterator reads an entire bucket at a time, reconstructing the 
values into an array to iterate within the
+ * bucket before moving onto the next bucket as the iterator is consumed.
+ */
+public final class FrontCodedIndexed implements Indexed<ByteBuffer>
+{
+  public static Supplier<FrontCodedIndexed> read(ByteBuffer buffer, ByteOrder 
ordering)
+  {
+    final ByteBuffer orderedBuffer = buffer.asReadOnlyBuffer().order(ordering);
+    final byte version = orderedBuffer.get();
+    Preconditions.checkArgument(version == 0, "only V0 exists, encountered " + 
version);
+    final int bucketSize = orderedBuffer.get();
+    final boolean hasNull = NullHandling.IS_NULL_BYTE == orderedBuffer.get();
+    final int numValues = VByte.readInt(orderedBuffer);
+    // size of offsets + values
+    final int size = VByte.readInt(orderedBuffer);
+    final int offsetsPosition = orderedBuffer.position();
+    // move position to end of buffer
+    buffer.position(offsetsPosition + size);
+
+    final int numBuckets = (int) Math.ceil((double) numValues / (double) 
bucketSize);
+    final int adjustIndex = hasNull ? 1 : 0;
+    final int div = Integer.numberOfTrailingZeros(bucketSize);
+    final int rem = bucketSize - 1;
+    return () -> new FrontCodedIndexed(
+        orderedBuffer,
+        bucketSize,
+        numBuckets,
+        (numValues & rem) == 0 ? bucketSize : numValues & rem,
+        hasNull,
+        numValues + adjustIndex,
+        adjustIndex,
+        div,
+        rem,
+        offsetsPosition,
+        offsetsPosition + ((numBuckets - 1) * Integer.BYTES)
+    );
+  }
+
+  private final ByteBuffer buffer;
+  private final int adjustedNumValues;
+  private final int adjustIndex;
+  private final int bucketSize;
+  private final int numBuckets;
+  private final int div;
+  private final int rem;
+  private final int offsetsPosition;
+  private final int bucketsPosition;
+  private final boolean hasNull;
+  private final int lastBucketNumValues;
+
+  private FrontCodedIndexed(
+      ByteBuffer buffer,
+      int bucketSize,
+      int numBuckets,
+      int lastBucketNumValues,
+      boolean hasNull,
+      int adjustedNumValues,
+      int adjustIndex,
+      int div,
+      int rem,
+      int offsetsPosition,
+      int bucketsPosition
+  )
+  {
+    if (Integer.bitCount(bucketSize) != 1) {
+      throw new ISE("bucketSize must be a power of two but was[%,d]", 
bucketSize);
+    }
+    this.buffer = buffer.asReadOnlyBuffer().order(buffer.order());
+    this.bucketSize = bucketSize;
+    this.hasNull = hasNull;
+    this.adjustedNumValues = adjustedNumValues;
+    this.adjustIndex = adjustIndex;
+    this.div = div;
+    this.rem = rem;
+    this.numBuckets = numBuckets;
+    this.offsetsPosition = offsetsPosition;
+    this.bucketsPosition = bucketsPosition;
+    this.lastBucketNumValues = lastBucketNumValues;
+  }
+
+  @Override
+  public int size()
+  {
+    return adjustedNumValues;
+  }
+
+  @Nullable
+  @Override
+  public ByteBuffer get(int index)
+  {
+    if (hasNull && index == 0) {
+      return null;
+    }
+
+    // due to vbyte encoding, the null value is not actually stored in the 
bucket (no negative values), so we adjust
+    // the index
+    final int adjustedIndex = index - adjustIndex;
+    // find the bucket which contains the value with maths
+    final int bucket = adjustedIndex >> div;
+    final int bucketIndex = adjustedIndex & rem;
+    final int offset = getBucketOffset(bucket);
+    buffer.position(offset);
+    return getFromBucket(buffer, bucketIndex);
+  }
+
+  @Override
+  public int indexOf(@Nullable ByteBuffer value)

Review Comment:
   Javadoc should explicitly tighten the contract such that this returns 
`(-(insertion point) - 1)` on no match.



##########
processing/src/main/java/org/apache/druid/segment/DictionaryEncodedColumnMerger.java:
##########
@@ -384,7 +387,10 @@ public void writeIndexes(@Nullable List<IntBuffer> 
segmentRowNumConversions) thr
     }
   }
 
-
+  protected DictionaryWriter<T> getWriter(String fileName)

Review Comment:
   IMO, a good compromise here would be to add javadocs to this method 
explaining the contract for overriding it, at least noting when it gets called.
   
   Also: `makeDictionaryWriter` seems like a better name than `getWriter`. It 
emphasizes that the method creates something, not fetches something. And that 
it's about the dictionary. (There's other writers beyond the dictionary writer.)



##########
processing/src/main/java/org/apache/druid/segment/column/IndexedUtf8ValueSetIndex.java:
##########
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
+import com.google.common.collect.PeekingIterator;
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.ImmutableBitmap;
+import org.apache.druid.java.util.common.ByteBufferUtils;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.BitmapResultFactory;
+import org.apache.druid.segment.data.Indexed;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.SortedSet;
+
+public final class IndexedUtf8ValueSetIndex<TDictionary extends 
Indexed<ByteBuffer>>
+    implements StringValueSetIndex, Utf8ValueSetIndex
+{
+  // This determines the cut-off point to swtich the merging algorithm from 
doing binary-search per element in the value

Review Comment:
   switch (spelling)



##########
processing/src/test/java/org/apache/druid/segment/IndexSpecTest.java:
##########
@@ -40,7 +40,7 @@ public void testSerde() throws Exception
     final ObjectMapper objectMapper = new DefaultObjectMapper();
     final String json =
         "{ \"bitmap\" : { \"type\" : \"roaring\" }, \"dimensionCompression\" : 
\"lz4\", \"metricCompression\" : \"lzf\""
-        + ", \"longEncoding\" : \"auto\" }";
+        + ", \"longEncoding\" : \"auto\", 
\"stringDictionaryEncoding\":{\"type\":\"front-coded\", \"bucketSize\":16}}";

Review Comment:
   `frontCoded` seems more consistent with how we usually spell parameters like 
this.



##########
processing/src/test/java/org/apache/druid/query/QueryRunnerTestHelper.java:
##########
@@ -371,7 +372,8 @@ public static <T, QueryType extends Query<T>> 
List<QueryRunner<T>> makeQueryRunn
             new QueryableIndexSegment(noRollupMMappedTestIndex, SEGMENT_ID),
             "noRollupMMappedTestIndex"
         ),
-        makeQueryRunner(factory, new 
QueryableIndexSegment(mergedRealtimeIndex, SEGMENT_ID), "mergedRealtimeIndex")
+        makeQueryRunner(factory, new 
QueryableIndexSegment(mergedRealtimeIndex, SEGMENT_ID), "mergedRealtimeIndex"),
+        makeQueryRunner(factory, new 
QueryableIndexSegment(frontCodedMappedTestIndex, SEGMENT_ID), 
"frontCodedMappedTestIndex")

Review Comment:
   `frontCodedMMappedTestIndex` (spelling)



##########
processing/src/test/java/org/apache/druid/segment/filter/BaseFilterTest.java:
##########
@@ -347,27 +348,44 @@ public static Collection<Object[]> makeConstructors()
             })
             .build();
 
+    StringEncodingStrategy[] stringEncoding = new StringEncodingStrategy[]{
+        new StringEncodingStrategy.Utf8(),
+        new StringEncodingStrategy.FrontCoded(4)
+    };
     for (Map.Entry<String, BitmapSerdeFactory> bitmapSerdeFactoryEntry : 
bitmapSerdeFactories.entrySet()) {
       for (Map.Entry<String, SegmentWriteOutMediumFactory> 
segmentWriteOutMediumFactoryEntry :
           segmentWriteOutMediumFactories.entrySet()) {
         for (Map.Entry<String, Function<IndexBuilder, Pair<StorageAdapter, 
Closeable>>> finisherEntry :
             finishers.entrySet()) {
           for (boolean cnf : ImmutableList.of(false, true)) {
             for (boolean optimize : ImmutableList.of(false, true)) {
-              final String testName = StringUtils.format(
-                  "bitmaps[%s], indexMerger[%s], finisher[%s], cnf[%s], 
optimize[%s]",
-                  bitmapSerdeFactoryEntry.getKey(),
-                  segmentWriteOutMediumFactoryEntry.getKey(),
-                  finisherEntry.getKey(),
-                  cnf,
-                  optimize
-              );
-              final IndexBuilder indexBuilder = IndexBuilder
-                  .create()
-                  .schema(DEFAULT_INDEX_SCHEMA)
-                  .indexSpec(new IndexSpec(bitmapSerdeFactoryEntry.getValue(), 
null, null, null))
-                  
.segmentWriteOutMediumFactory(segmentWriteOutMediumFactoryEntry.getValue());
-              constructors.add(new Object[]{testName, indexBuilder, 
finisherEntry.getValue(), cnf, optimize});
+              for (StringEncodingStrategy encodingStrategy : stringEncoding) {
+                final String testName = StringUtils.format(
+                    "bitmaps[%s], indexMerger[%s], finisher[%s], cnf[%s], 
optimize[%s], encoding[%s]",

Review Comment:
   `stringDictionaryEncoding[%s]`?



##########
processing/src/main/java/org/apache/druid/segment/column/StringEncodingStrategies.java:
##########
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.data.DictionaryWriter;
+import org.apache.druid.segment.data.EncodedStringDictionaryWriter;
+import org.apache.druid.segment.data.FrontCodedIndexedWriter;
+import org.apache.druid.segment.data.GenericIndexed;
+import org.apache.druid.segment.data.GenericIndexedWriter;
+import org.apache.druid.segment.data.Indexed;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.util.Iterator;
+
+public class StringEncodingStrategies
+{
+  public static DictionaryWriter<String> getStringDictionaryWriter(
+      StringEncodingStrategy encodingStrategy,
+      SegmentWriteOutMedium writeoutMedium,
+      String fileName
+  )
+  {
+    // write plain utf8 in the legacy format, where generic indexed was 
written directly
+    if (StringEncodingStrategy.UTF8.equals(encodingStrategy.getType())) {
+      return new GenericIndexedWriter<>(writeoutMedium, fileName, 
GenericIndexed.STRING_STRATEGY);
+    } else {
+      // otherwise, we wrap in an EncodedStringDictionaryWriter so that we 
write a small header that includes
+      // a version byte that should hopefully never conflict with a 
GenericIndexed version, along with a byte
+      // from StringEncodingStrategy.getId to indicate which encoding strategy 
is used for the dictionary before
+      // writing the dictionary itself
+      DictionaryWriter<byte[]> writer;
+      if 
(StringEncodingStrategy.FRONT_CODED.equals(encodingStrategy.getType())) {
+        writer = new FrontCodedIndexedWriter(
+            writeoutMedium,
+            ByteOrder.nativeOrder(),

Review Comment:
   Better to use little endian. Otherwise, segments written on different 
architectures won't be compatible, which is weird.



##########
processing/src/main/java/org/apache/druid/segment/column/StringEncodingStrategy.java:
##########
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+
+import javax.annotation.Nullable;
+import java.util.Objects;
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = 
StringEncodingStrategy.Utf8.class)

Review Comment:
   Is it necessary to have a `defaultImpl`? I always try to avoid it whenever 
possible, because it has this weird behavior where an unregistered `type` gets 
assigned to the default impl. It means that typos give you the default impl 
silently, which trips people up.



##########
processing/src/main/java/org/apache/druid/segment/IndexSpec.java:
##########
@@ -116,37 +98,44 @@ public IndexSpec(
    * @param dimensionCompression compression format for dimension columns, 
null to use the default.
    *                             Defaults to {@link 
CompressionStrategy#DEFAULT_COMPRESSION_STRATEGY}
    *
+   * @param stringDictionaryEncoding encoding strategy for string dictionaries 
of dictionary encoded string columns
+   *
    * @param metricCompression compression format for primitive type metric 
columns, null to use the default.
    *                          Defaults to {@link 
CompressionStrategy#DEFAULT_COMPRESSION_STRATEGY}
    *
    * @param longEncoding encoding strategy for metric and dimension columns 
with type long, null to use the default.
    *                     Defaults to {@link 
CompressionFactory#DEFAULT_LONG_ENCODING_STRATEGY}
+   *
+   * @param segmentLoader specify a {@link SegmentizerFactory} which will be 
written to 'factory.json' and used to load
+   *                      the written segment
    */
   @JsonCreator
   public IndexSpec(

Review Comment:
   What do you think about adding documentation for the new feature? Any reason 
not to? It would go here: 
https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#indexspec



##########
core/src/main/java/org/apache/druid/segment/data/VByte.java:
##########
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import java.nio.ByteBuffer;
+
+public class VByte
+{
+  /**
+   * Read a variable byte (vbyte) encoded integer from a {@link ByteBuffer} at 
the current position.
+   *
+   * vbyte encoding stores values in the last 7 bits of a byte and reserves 
the high bit for the 'contination'. If 0,
+   * one or more aditional bytes must be read to complete the value, and a 1 
indicates the terminal byte. Because of
+   * this, it can only store positive values, and larger integers can take up 
to 5 bytes.
+   *
+   * implementation based on:
+   * 
https://github.com/lemire/JavaFastPFOR/blob/master/src/main/java/me/lemire/integercompression/VariableByte.java
+   *
+   */
+  public static int readInt(ByteBuffer buffer)
+  {
+    byte b;
+    int v = (b = buffer.get()) & 0x7F;
+    if (b < 0) {
+      return v;
+    }
+    v = (((b = buffer.get()) & 0x7F) << 7) | v;
+    if (b < 0) {
+      return v;
+    }
+    v = (((b = buffer.get()) & 0x7F) << 14) | v;
+    if (b < 0) {
+      return v;
+    }
+    v = (((b = buffer.get()) & 0x7F) << 21) | v;
+    if (b < 0) {
+      return v;
+    }
+    v = ((buffer.get() & 0x7F) << 28) | v;
+    return v;
+  }
+
+  /**
+   * Write a variable byte (vbyte) encoded integer to a {@link ByteBuffer} at 
the current position.

Review Comment:
   Javadoc should specify whether the buffer position is advanced, and by how 
much.



##########
processing/src/main/java/org/apache/druid/segment/column/StringEncodingStrategy.java:
##########
@@ -0,0 +1,147 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+
+import javax.annotation.Nullable;
+import java.util.Objects;
+
+@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = 
StringEncodingStrategy.Utf8.class)
+@JsonSubTypes({
+    @JsonSubTypes.Type(value = StringEncodingStrategy.Utf8.class, name = 
StringEncodingStrategy.UTF8),
+    @JsonSubTypes.Type(value = StringEncodingStrategy.FrontCoded.class, name = 
StringEncodingStrategy.FRONT_CODED)
+})
+public interface StringEncodingStrategy
+{
+  Utf8 DEFAULT = new Utf8();
+  String UTF8 = "utf8";
+  String FRONT_CODED = "front-coded";
+
+  byte UTF8_ID = 0x00;
+  byte FRONT_CODED_ID = 0x01;
+
+  String getType();
+
+  byte getId();
+
+  @JsonTypeName(UTF8)
+  class Utf8 implements StringEncodingStrategy
+  {
+    @Override
+    public String getType()
+    {
+      return UTF8;
+    }
+
+    @Override
+    public byte getId()
+    {
+      return UTF8_ID;
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hashCode(UTF8);
+    }
+
+    @Override
+    public String toString()
+    {
+      return "Utf8{}";
+    }
+  }
+
+  @JsonTypeName(FRONT_CODED)

Review Comment:
   nit, I don't think we need this _and_ `@JsonSubTypes.Type`.



##########
processing/src/main/java/org/apache/druid/segment/column/StringEncodingStrategies.java:
##########
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.data.DictionaryWriter;
+import org.apache.druid.segment.data.EncodedStringDictionaryWriter;
+import org.apache.druid.segment.data.FrontCodedIndexedWriter;
+import org.apache.druid.segment.data.GenericIndexed;
+import org.apache.druid.segment.data.GenericIndexedWriter;
+import org.apache.druid.segment.data.Indexed;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.util.Iterator;
+
+public class StringEncodingStrategies
+{
+  public static DictionaryWriter<String> getStringDictionaryWriter(
+      StringEncodingStrategy encodingStrategy,
+      SegmentWriteOutMedium writeoutMedium,
+      String fileName
+  )
+  {
+    // write plain utf8 in the legacy format, where generic indexed was 
written directly
+    if (StringEncodingStrategy.UTF8.equals(encodingStrategy.getType())) {
+      return new GenericIndexedWriter<>(writeoutMedium, fileName, 
GenericIndexed.STRING_STRATEGY);
+    } else {
+      // otherwise, we wrap in an EncodedStringDictionaryWriter so that we 
write a small header that includes
+      // a version byte that should hopefully never conflict with a 
GenericIndexed version, along with a byte

Review Comment:
   "should hopefully" doesn't exactly inspire confidence! Can we ensure that it 
doesn't, perhaps by reserving a GenericIndexed version byte such that it will 
never actually be used by GenericIndexed, and then using that here?
   
   Or, at least, a comment in GenericIndexed so people editing that class in 
the future don't mess this up.



##########
processing/src/main/java/org/apache/druid/segment/data/FrontCodedIndexed.java:
##########
@@ -0,0 +1,393 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import com.google.common.base.Preconditions;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+
+/**
+ * {@link Indexed} specialized for storing variable-width binary values (such 
as utf8 encoded strings), which must be
+ * sorted and unique, using 'front coding'. Front coding is a type of delta 
encoding for byte arrays, where sorted
+ * values are grouped into buckets. The first value of the bucket is written 
entirely, and remaining values are stored
+ * as a pair of an integer which indicates how much of the first byte array of 
the bucket to use as a prefix, followed
+ * by the remaining bytes after the prefix to complete the value.
+ *
+ * Getting a value first picks the appropriate bucket, finds its offset in the 
underlying buffer, then scans the bucket
+ * values to seek to the correct position of the value within the bucket in 
order to reconstruct it using the prefix
+ * length.
+ *
+ * Finding the index of a value involves binary searching the first values of 
each bucket to find the correct bucket,
+ * then a linear scan within the bucket to find the matching value (or 
negative insertion point -1 for values that
+ * are not present).
+ *
+ * The value iterator reads an entire bucket at a time, reconstructing the 
values into an array to iterate within the
+ * bucket before moving onto the next bucket as the iterator is consumed.
+ */
+public final class FrontCodedIndexed implements Indexed<ByteBuffer>
+{
+  public static FrontCodedIndexed read(ByteBuffer buffer, 
Comparator<ByteBuffer> comparator, ByteOrder ordering)
+  {
+    final ByteBuffer copy = buffer.asReadOnlyBuffer().order(ordering);
+    final byte version = copy.get();
+    Preconditions.checkArgument(version == 0, "only V0 exists, encountered " + 
version);
+    final int bucketSize = copy.get();
+    final boolean hasNull = NullHandling.IS_NULL_BYTE == copy.get();
+    final int numValues = VByte.readInt(copy);
+    // size of offsets + values
+    final int size = VByte.readInt(copy);
+    // move position to end of buffer
+    buffer.position(copy.position() + size);
+
+    final int numBuckets = (int) Math.ceil((double) numValues / (double) 
bucketSize);
+    final int adjustIndex = hasNull ? 1 : 0;
+    final int div = Integer.numberOfTrailingZeros(bucketSize);
+    final int rem = bucketSize - 1;
+    return new FrontCodedIndexed(
+        copy,
+        comparator,
+        bucketSize,
+        numBuckets,
+        (numValues & rem) == 0 ? bucketSize : numValues & rem,
+        hasNull,
+        numValues + adjustIndex,
+        adjustIndex,
+        div,
+        rem,
+        copy.position(),
+        copy.position() + ((numBuckets - 1) * Integer.BYTES)
+    );
+  }
+
+  private final ByteBuffer buffer;
+  private final int adjustedNumValues;
+  private final int adjustIndex;
+  private final int bucketSize;
+  private final int numBuckets;
+  private final int div;
+  private final int rem;
+  private final int offsetsPosition;
+  private final int bucketsPosition;
+  private final boolean hasNull;
+  private final int lastBucketNumValues;
+  private final Comparator<ByteBuffer> comparator;
+
+  private FrontCodedIndexed(
+      ByteBuffer buffer,
+      Comparator<ByteBuffer> comparator,
+      int bucketSize,
+      int numBuckets,
+      int lastBucketNumValues,
+      boolean hasNull,
+      int adjustedNumValues,
+      int adjustIndex,
+      int div,
+      int rem,
+      int offsetsPosition,
+      int bucketsPosition
+  )
+  {
+    if (Integer.bitCount(bucketSize) != 1) {
+      throw new ISE("bucketSize must be a power of two but was[%,d]", 
bucketSize);
+    }
+    this.buffer = buffer;
+    this.comparator = comparator;
+    this.bucketSize = bucketSize;
+    this.hasNull = hasNull;
+    this.adjustedNumValues = adjustedNumValues;
+    this.adjustIndex = adjustIndex;
+    this.div = div;
+    this.rem = rem;
+    this.numBuckets = numBuckets;
+    this.offsetsPosition = offsetsPosition;
+    this.bucketsPosition = bucketsPosition;
+    this.lastBucketNumValues = lastBucketNumValues;
+  }
+
+  @Override
+  public int size()
+  {
+    return adjustedNumValues;
+  }
+
+  @Nullable
+  @Override
+  public ByteBuffer get(int index)
+  {
+    final int adjustedIndex;
+    // due to vbyte encoding, the null value is not actually stored in the 
bucket (no negative values), so we adjust
+    // the index

Review Comment:
   I guess that @imply-cheddar means this:
   
   - `null` is always the value for dictionary id 0
   - the value for dictionary id N (where N > 0) is position N - 1 in the 
dictionary
   
   Seems fine, although in the interest of being able to share more code 
between the front-coded dictionary and legacy dictionary impl, it's ok to keep 
it the way you have it in the patch. It's closer to the legacy impl.



##########
processing/src/main/java/org/apache/druid/segment/column/StringFrontCodedDictionaryEncodedColumn.java:
##########
@@ -0,0 +1,598 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.google.common.base.Predicate;
+import com.google.common.base.Predicates;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.extraction.ExtractionFn;
+import org.apache.druid.query.filter.ValueMatcher;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.AbstractDimensionSelector;
+import org.apache.druid.segment.DimensionSelectorUtils;
+import org.apache.druid.segment.IdLookup;
+import org.apache.druid.segment.data.ColumnarInts;
+import org.apache.druid.segment.data.ColumnarMultiInts;
+import org.apache.druid.segment.data.FrontCodedIndexed;
+import org.apache.druid.segment.data.IndexedInts;
+import org.apache.druid.segment.data.ReadableOffset;
+import org.apache.druid.segment.data.SingleIndexedInt;
+import org.apache.druid.segment.filter.BooleanValueMatcher;
+import org.apache.druid.segment.historical.HistoricalDimensionSelector;
+import 
org.apache.druid.segment.historical.SingleValueHistoricalDimensionSelector;
+import org.apache.druid.segment.vector.MultiValueDimensionVectorSelector;
+import org.apache.druid.segment.vector.ReadableVectorInspector;
+import org.apache.druid.segment.vector.ReadableVectorOffset;
+import org.apache.druid.segment.vector.SingleValueDimensionVectorSelector;
+import org.apache.druid.segment.vector.VectorObjectSelector;
+import org.apache.druid.utils.CloseableUtils;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.BitSet;
+
+/**
+ * {@link DictionaryEncodedColumn<String>} for a column which uses a {@link 
FrontCodedIndexed} to store its value
+ * dictionary, which 'delta encodes' strings (instead of {@link 
org.apache.druid.segment.data.GenericIndexed} like
+ * {@link StringDictionaryEncodedColumn}).
+ *
+ * This class is otherwise nearly identical to {@link 
StringDictionaryEncodedColumn} other than the dictionary

Review Comment:
   Yeah, this is a pretty big class to have duplicated. It would be good to do 
the follow-up work to figure out how to consolidate them.
   
   For now, can you add a comment to StringDictionaryEncodedColumn that reminds 
people to make any changes here too?



##########
processing/src/main/java/org/apache/druid/segment/column/StringEncodingStrategies.java:
##########
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.data.DictionaryWriter;
+import org.apache.druid.segment.data.EncodedStringDictionaryWriter;
+import org.apache.druid.segment.data.FrontCodedIndexedWriter;
+import org.apache.druid.segment.data.GenericIndexed;
+import org.apache.druid.segment.data.GenericIndexedWriter;
+import org.apache.druid.segment.data.Indexed;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.util.Iterator;
+
+public class StringEncodingStrategies
+{
+  public static DictionaryWriter<String> getStringDictionaryWriter(
+      StringEncodingStrategy encodingStrategy,
+      SegmentWriteOutMedium writeoutMedium,
+      String fileName
+  )
+  {
+    // write plain utf8 in the legacy format, where generic indexed was 
written directly
+    if (StringEncodingStrategy.UTF8.equals(encodingStrategy.getType())) {
+      return new GenericIndexedWriter<>(writeoutMedium, fileName, 
GenericIndexed.STRING_STRATEGY);
+    } else {
+      // otherwise, we wrap in an EncodedStringDictionaryWriter so that we 
write a small header that includes
+      // a version byte that should hopefully never conflict with a 
GenericIndexed version, along with a byte
+      // from StringEncodingStrategy.getId to indicate which encoding strategy 
is used for the dictionary before
+      // writing the dictionary itself
+      DictionaryWriter<byte[]> writer;
+      if 
(StringEncodingStrategy.FRONT_CODED.equals(encodingStrategy.getType())) {
+        writer = new FrontCodedIndexedWriter(
+            writeoutMedium,
+            ByteOrder.nativeOrder(),
+            ((StringEncodingStrategy.FrontCoded) 
encodingStrategy).getBucketSize()
+        );
+      } else {
+        throw new ISE("Unknown encoding strategy: %s", 
encodingStrategy.getType());
+      }
+      return new EncodedStringDictionaryWriter(writer, encodingStrategy);
+    }
+  }
+
+  /**
+   * Adapter to convert {@link Indexed<ByteBuffer>} with utf8 encoded bytes 
into {@link Indexed<String>} to be frinedly

Review Comment:
   friendly (spelling)



##########
processing/src/main/java/org/apache/druid/segment/column/IndexedUtf8ValueSetIndex.java:
##########
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
+import com.google.common.collect.PeekingIterator;
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.ImmutableBitmap;
+import org.apache.druid.java.util.common.ByteBufferUtils;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.BitmapResultFactory;
+import org.apache.druid.segment.data.Indexed;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.SortedSet;
+
+public final class IndexedUtf8ValueSetIndex<TDictionary extends 
Indexed<ByteBuffer>>
+    implements StringValueSetIndex, Utf8ValueSetIndex
+{
+  // This determines the cut-off point to swtich the merging algorithm from 
doing binary-search per element in the value
+  // set to doing a sorted merge algorithm between value set and dictionary. 
The ratio here represents the ratio b/w
+  // the number of elements in value set and the number of elements in the 
dictionary. The number has been derived
+  // using benchmark in https://github.com/apache/druid/pull/13133. If the 
ratio is higher than the threshold, we use
+  // sorted merge instead of binary-search based algorithm.
+  private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
+  private static final int SIZE_WORTH_CHECKING_MIN = 8;
+  private static final Comparator<ByteBuffer> COMPARATOR = 
ByteBufferUtils.unsignedComparator();
+
+  private final BitmapFactory bitmapFactory;
+  private final TDictionary dictionary;
+  private final Indexed<ImmutableBitmap> bitmaps;
+
+  public IndexedUtf8ValueSetIndex(
+      BitmapFactory bitmapFactory,
+      TDictionary dictionary,
+      Indexed<ImmutableBitmap> bitmaps
+  )
+  {
+    this.bitmapFactory = bitmapFactory;
+    this.dictionary = dictionary;
+    this.bitmaps = bitmaps;
+  }
+
+  @Override
+  public BitmapColumnIndex forValue(@Nullable String value)
+  {
+    return new SimpleBitmapColumnIndex()
+    {
+      @Override
+      public double estimateSelectivity(int totalRows)
+      {
+        return Math.min(1, (double) getBitmapForValue().size() / totalRows);
+      }
+
+      @Override
+      public <T> T computeBitmapResult(BitmapResultFactory<T> 
bitmapResultFactory)
+      {
+
+        return bitmapResultFactory.wrapDimensionValue(getBitmapForValue());
+      }
+
+      private ImmutableBitmap getBitmapForValue()
+      {
+        final ByteBuffer valueUtf8 = value == null ? null : 
ByteBuffer.wrap(StringUtils.toUtf8(value));
+        final int idx = dictionary.indexOf(valueUtf8);
+        return getBitmap(idx);
+      }
+    };
+  }
+
+  @Override
+  public BitmapColumnIndex forSortedValues(SortedSet<String> values)
+  {
+    return getBitmapColumnIndexForSortedIterableUtf8(
+        Iterables.transform(
+            values,
+            input -> input != null ? 
ByteBuffer.wrap(StringUtils.toUtf8(input)) : null
+        ),
+        values.size()
+    );
+  }
+
+  @Override
+  public BitmapColumnIndex forSortedValuesUtf8(SortedSet<ByteBuffer> 
valuesUtf8)
+  {
+    final SortedSet<ByteBuffer> tailSet;
+
+    if (valuesUtf8.size() >= SIZE_WORTH_CHECKING_MIN) {
+      final ByteBuffer minValueInColumn = dictionary.get(0);
+      tailSet = valuesUtf8.tailSet(minValueInColumn);
+    } else {
+      tailSet = valuesUtf8;
+    }
+
+    return getBitmapColumnIndexForSortedIterableUtf8(tailSet, tailSet.size());
+  }
+
+  private ImmutableBitmap getBitmap(int idx)
+  {
+    if (idx < 0) {
+      return bitmapFactory.makeEmptyImmutableBitmap();
+    }
+
+    final ImmutableBitmap bitmap = bitmaps.get(idx);
+    return bitmap == null ? bitmapFactory.makeEmptyImmutableBitmap() : bitmap;
+  }
+
+  /**
+   * Helper used by {@link #forSortedValues} and {@link #forSortedValuesUtf8}.
+   */
+  private BitmapColumnIndex 
getBitmapColumnIndexForSortedIterableUtf8(Iterable<ByteBuffer> valuesUtf8, int 
size)
+  {
+    // for large number of in-filter values in comparison to the dictionary 
size, use the sorted merge algorithm.
+    if (size > SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) {
+      return new SimpleImmutableBitmapIterableIndex()
+      {
+        @Override
+        public Iterable<ImmutableBitmap> getBitmapIterable()
+        {
+          return () -> new Iterator<ImmutableBitmap>()
+          {
+            final PeekingIterator<ByteBuffer> valuesIterator = 
Iterators.peekingIterator(valuesUtf8.iterator());
+            final PeekingIterator<ByteBuffer> dictionaryIterator = 
Iterators.peekingIterator(dictionary.iterator());
+            int next = -1;
+            int idx = 0;
+
+            @Override
+            public boolean hasNext()
+            {
+              if (next < 0) {
+                findNext();
+              }
+              return next >= 0;
+            }
+
+            @Override
+            public ImmutableBitmap next()
+            {
+              if (next < 0) {
+                findNext();
+                if (next < 0) {
+                  throw new NoSuchElementException();
+                }
+              }
+              final int swap = next;
+              next = -1;
+              return getBitmap(swap);
+            }
+
+            private void findNext()
+            {
+              while (next < 0 && valuesIterator.hasNext() && 
dictionaryIterator.hasNext()) {
+                final ByteBuffer nextValue = valuesIterator.peek();
+                final ByteBuffer nextDictionaryKey = dictionaryIterator.peek();
+                final int comparison = COMPARATOR.compare(nextValue, 
nextDictionaryKey);
+                if (comparison == 0) {
+                  next = idx;
+                  valuesIterator.next();
+                  break;
+                } else if (comparison < 0) {
+                  valuesIterator.next();
+                } else {
+                  dictionaryIterator.next();
+                  idx++;
+                }
+              }
+            }
+          };
+        }
+      };
+    }
+
+    // if the size of in-filter values is less than the threshold percentage 
of dictionary size, then use binary search
+    // based lookup per value. The algorithm works well for smaller number of 
values.
+    return new SimpleImmutableBitmapIterableIndex()
+    {
+      @Override
+      public Iterable<ImmutableBitmap> getBitmapIterable()
+      {
+        return () -> new Iterator<ImmutableBitmap>()
+        {
+          final int dictionarySize = dictionary.size();
+          final Iterator<ByteBuffer> iterator = valuesUtf8.iterator();
+          int next = -1;
+
+          @Override
+          public boolean hasNext()
+          {
+            if (next < 0) {
+              findNext();
+            }
+            return next >= 0;
+          }
+
+          @Override
+          public ImmutableBitmap next()
+          {
+            if (next < 0) {
+              findNext();
+              if (next < 0) {
+                throw new NoSuchElementException();
+              }
+            }
+            final int swap = next;
+            next = -1;
+            return getBitmap(swap);
+          }
+
+          private void findNext()
+          {
+            while (next < 0 && iterator.hasNext()) {
+              ByteBuffer nextValue = iterator.next();
+              next = dictionary.indexOf(nextValue);
+
+              if (next == -dictionarySize - 1) {
+                // nextValue is past the end of the dictionary.
+                // Note: we can rely on indexOf returning (-(insertion point) 
- 1), even though Indexed doesn't
+                // guarantee it, because "dictionary" comes from 
GenericIndexed singleThreaded().

Review Comment:
   Could you modify the comment to be accurate?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

Reply via email to