[GitHub] [druid] imply-cheddar commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

GitBox Fri, 14 Oct 2022 00:17:00 -0700


imply-cheddar commented on code in PR #12277:
URL: https://github.com/apache/druid/pull/12277#discussion_r995391240



##########
processing/src/main/java/org/apache/druid/segment/column/IndexedUtf8ValueSetIndex.java:
##########
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
+import com.google.common.collect.PeekingIterator;
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.ImmutableBitmap;
+import org.apache.druid.java.util.common.ByteBufferUtils;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.BitmapResultFactory;
+import org.apache.druid.segment.data.Indexed;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.SortedSet;
+
+public final class IndexedUtf8ValueSetIndex<TDictionary extends 
Indexed<ByteBuffer>>
+    implements StringValueSetIndex, Utf8ValueSetIndex
+{
+  // This determines the cut-off point to swtich the merging algorithm from 
doing binary-search per element in the value
+  // set to doing a sorted merge algorithm between value set and dictionary. 
The ratio here represents the ratio b/w
+  // the number of elements in value set and the number of elements in the 
dictionary. The number has been derived
+  // using benchmark in https://github.com/apache/druid/pull/13133. If the 
ratio is higher than the threshold, we use
+  // sorted merge instead of binary-search based algorithm.
+  private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
+  private static final int SIZE_WORTH_CHECKING_MIN = 8;
+  private static final Comparator<ByteBuffer> COMPARATOR = 
ByteBufferUtils.unsignedComparator();
+
+  private final BitmapFactory bitmapFactory;
+  private final TDictionary dictionary;
+  private final Indexed<ImmutableBitmap> bitmaps;
+
+  public IndexedUtf8ValueSetIndex(
+      BitmapFactory bitmapFactory,
+      TDictionary dictionary,
+      Indexed<ImmutableBitmap> bitmaps
+  )
+  {
+    this.bitmapFactory = bitmapFactory;
+    this.dictionary = dictionary;
+    this.bitmaps = bitmaps;
+  }
+
+  @Override
+  public BitmapColumnIndex forValue(@Nullable String value)
+  {
+    return new SimpleBitmapColumnIndex()
+    {
+      @Override
+      public double estimateSelectivity(int totalRows)
+      {
+        return Math.min(1, (double) getBitmapForValue().size() / totalRows);
+      }
+
+      @Override
+      public <T> T computeBitmapResult(BitmapResultFactory<T> 
bitmapResultFactory)
+      {
+
+        return bitmapResultFactory.wrapDimensionValue(getBitmapForValue());
+      }
+
+      private ImmutableBitmap getBitmapForValue()
+      {
+        final ByteBuffer valueUtf8 = value == null ? null : 
ByteBuffer.wrap(StringUtils.toUtf8(value));

Review Comment:
   Do we ever call `.forValue(String)` and then throw away the result?  If not, 
it should be safe to move this to line 69 and then close over it instead.  That 
way we don't convert to utf8 bytes multiple times (right now, it will happen on 
every call to `estimateSelectivity` and `computeBitmapResult`)



##########
processing/src/main/java/org/apache/druid/segment/column/IndexedUtf8ValueSetIndex.java:
##########
@@ -0,0 +1,249 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
+import com.google.common.collect.PeekingIterator;
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.ImmutableBitmap;
+import org.apache.druid.java.util.common.ByteBufferUtils;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.query.BitmapResultFactory;
+import org.apache.druid.segment.data.Indexed;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.util.Comparator;
+import java.util.Iterator;
+import java.util.NoSuchElementException;
+import java.util.SortedSet;
+
+public final class IndexedUtf8ValueSetIndex<TDictionary extends 
Indexed<ByteBuffer>>
+    implements StringValueSetIndex, Utf8ValueSetIndex
+{
+  // This determines the cut-off point to swtich the merging algorithm from 
doing binary-search per element in the value
+  // set to doing a sorted merge algorithm between value set and dictionary. 
The ratio here represents the ratio b/w
+  // the number of elements in value set and the number of elements in the 
dictionary. The number has been derived
+  // using benchmark in https://github.com/apache/druid/pull/13133. If the 
ratio is higher than the threshold, we use
+  // sorted merge instead of binary-search based algorithm.
+  private static final double SORTED_MERGE_RATIO_THRESHOLD = 0.12D;
+  private static final int SIZE_WORTH_CHECKING_MIN = 8;
+  private static final Comparator<ByteBuffer> COMPARATOR = 
ByteBufferUtils.unsignedComparator();
+
+  private final BitmapFactory bitmapFactory;
+  private final TDictionary dictionary;
+  private final Indexed<ImmutableBitmap> bitmaps;
+
+  public IndexedUtf8ValueSetIndex(
+      BitmapFactory bitmapFactory,
+      TDictionary dictionary,
+      Indexed<ImmutableBitmap> bitmaps
+  )
+  {
+    this.bitmapFactory = bitmapFactory;
+    this.dictionary = dictionary;
+    this.bitmaps = bitmaps;
+  }
+
+  @Override
+  public BitmapColumnIndex forValue(@Nullable String value)
+  {
+    return new SimpleBitmapColumnIndex()
+    {
+      @Override
+      public double estimateSelectivity(int totalRows)
+      {
+        return Math.min(1, (double) getBitmapForValue().size() / totalRows);
+      }
+
+      @Override
+      public <T> T computeBitmapResult(BitmapResultFactory<T> 
bitmapResultFactory)
+      {
+
+        return bitmapResultFactory.wrapDimensionValue(getBitmapForValue());
+      }
+
+      private ImmutableBitmap getBitmapForValue()
+      {
+        final ByteBuffer valueUtf8 = value == null ? null : 
ByteBuffer.wrap(StringUtils.toUtf8(value));
+        final int idx = dictionary.indexOf(valueUtf8);
+        return getBitmap(idx);
+      }
+    };
+  }
+
+  @Override
+  public BitmapColumnIndex forSortedValues(SortedSet<String> values)
+  {
+    return getBitmapColumnIndexForSortedIterableUtf8(
+        Iterables.transform(
+            values,
+            input -> input != null ? 
ByteBuffer.wrap(StringUtils.toUtf8(input)) : null
+        ),
+        values.size()
+    );
+  }
+
+  @Override
+  public BitmapColumnIndex forSortedValuesUtf8(SortedSet<ByteBuffer> 
valuesUtf8)
+  {
+    final SortedSet<ByteBuffer> tailSet;
+
+    if (valuesUtf8.size() >= SIZE_WORTH_CHECKING_MIN) {
+      final ByteBuffer minValueInColumn = dictionary.get(0);
+      tailSet = valuesUtf8.tailSet(minValueInColumn);
+    } else {
+      tailSet = valuesUtf8;
+    }
+
+    return getBitmapColumnIndexForSortedIterableUtf8(tailSet, tailSet.size());
+  }
+
+  private ImmutableBitmap getBitmap(int idx)
+  {
+    if (idx < 0) {
+      return bitmapFactory.makeEmptyImmutableBitmap();
+    }
+
+    final ImmutableBitmap bitmap = bitmaps.get(idx);
+    return bitmap == null ? bitmapFactory.makeEmptyImmutableBitmap() : bitmap;
+  }
+
+  /**
+   * Helper used by {@link #forSortedValues} and {@link #forSortedValuesUtf8}.
+   */
+  private BitmapColumnIndex 
getBitmapColumnIndexForSortedIterableUtf8(Iterable<ByteBuffer> valuesUtf8, int 
size)
+  {
+    // for large number of in-filter values in comparison to the dictionary 
size, use the sorted merge algorithm.
+    if (size > SORTED_MERGE_RATIO_THRESHOLD * dictionary.size()) {
+      return new SimpleImmutableBitmapIterableIndex()
+      {
+        @Override
+        public Iterable<ImmutableBitmap> getBitmapIterable()
+        {
+          return () -> new Iterator<ImmutableBitmap>()
+          {
+            final PeekingIterator<ByteBuffer> valuesIterator = 
Iterators.peekingIterator(valuesUtf8.iterator());
+            final PeekingIterator<ByteBuffer> dictionaryIterator = 
Iterators.peekingIterator(dictionary.iterator());
+            int next = -1;
+            int idx = 0;
+
+            @Override
+            public boolean hasNext()
+            {
+              if (next < 0) {
+                findNext();
+              }
+              return next >= 0;
+            }
+
+            @Override
+            public ImmutableBitmap next()
+            {
+              if (next < 0) {
+                findNext();
+                if (next < 0) {
+                  throw new NoSuchElementException();
+                }
+              }
+              final int swap = next;
+              next = -1;
+              return getBitmap(swap);
+            }
+
+            private void findNext()
+            {
+              while (next < 0 && valuesIterator.hasNext() && 
dictionaryIterator.hasNext()) {
+                final ByteBuffer nextValue = valuesIterator.peek();
+                final ByteBuffer nextDictionaryKey = dictionaryIterator.peek();
+                final int comparison = COMPARATOR.compare(nextValue, 
nextDictionaryKey);
+                if (comparison == 0) {
+                  next = idx;
+                  valuesIterator.next();
+                  break;
+                } else if (comparison < 0) {
+                  valuesIterator.next();
+                } else {
+                  dictionaryIterator.next();
+                  idx++;
+                }
+              }
+            }
+          };
+        }
+      };
+    }
+
+    // if the size of in-filter values is less than the threshold percentage 
of dictionary size, then use binary search
+    // based lookup per value. The algorithm works well for smaller number of 
values.
+    return new SimpleImmutableBitmapIterableIndex()
+    {
+      @Override
+      public Iterable<ImmutableBitmap> getBitmapIterable()
+      {
+        return () -> new Iterator<ImmutableBitmap>()
+        {
+          final int dictionarySize = dictionary.size();
+          final Iterator<ByteBuffer> iterator = valuesUtf8.iterator();
+          int next = -1;
+
+          @Override
+          public boolean hasNext()
+          {
+            if (next < 0) {
+              findNext();
+            }
+            return next >= 0;
+          }
+
+          @Override
+          public ImmutableBitmap next()
+          {
+            if (next < 0) {
+              findNext();
+              if (next < 0) {
+                throw new NoSuchElementException();
+              }
+            }
+            final int swap = next;
+            next = -1;
+            return getBitmap(swap);
+          }
+
+          private void findNext()
+          {
+            while (next < 0 && iterator.hasNext()) {
+              ByteBuffer nextValue = iterator.next();
+              next = dictionary.indexOf(nextValue);
+
+              if (next == -dictionarySize - 1) {
+                // nextValue is past the end of the dictionary.
+                // Note: we can rely on indexOf returning (-(insertion point) 
- 1), even though Indexed doesn't
+                // guarantee it, because "dictionary" comes from 
GenericIndexed singleThreaded().

Review Comment:
   I question if this comment is still accurate.



##########
processing/src/main/java/org/apache/druid/segment/data/FrontCodedIndexedWriter.java:
##########
@@ -0,0 +1,296 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import com.google.common.primitives.Ints;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.io.Channels;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.io.smoosh.FileSmoosher;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+import org.apache.druid.segment.writeout.WriteOutBytes;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.channels.WritableByteChannel;
+
+
+/**
+ * {@link DictionaryWriter} for a {@link FrontCodedIndexed}, written to a 
{@link SegmentWriteOutMedium}. Values MUST
+ * be added to this dictionary writer in sorted order, which is enforced.
+ *
+ * Front coding is a type of delta encoding for byte arrays, where values are 
grouped into buckets. The first value of
+ * the bucket is written entirely, and remaining values are stored as pairs of 
an integer which indicates how much
+ * of the first byte array of the bucket to use as a prefix, followed by the 
remaining value bytes after the prefix.
+ *
+ * This is valid to use for any values which can be compared byte by byte with 
unsigned comparison. Otherwise, this
+ * is not the collection for you.
+ */
+public class FrontCodedIndexedWriter implements DictionaryWriter<byte[]>
+{
+  private static final int MAX_LOG_BUFFER_SIZE = 26;
+  private final SegmentWriteOutMedium segmentWriteOutMedium;
+  private final int bucketSize;
+  private final ByteOrder byteOrder;
+  private final byte[][] bucketBuffer;
+  @Nullable
+  private byte[] prevObject = null;
+  @Nullable
+  private WriteOutBytes headerOut = null;
+  @Nullable
+  private WriteOutBytes valuesOut = null;
+  private int numWritten = 0;
+  private ByteBuffer scratch;
+  private int logScratchSize = 10;
+  private boolean isClosed = false;
+  private boolean hasNulls = false;
+
+
+  public FrontCodedIndexedWriter(
+      SegmentWriteOutMedium segmentWriteOutMedium,
+      ByteOrder byteOrder,
+      int bucketSize
+  )
+  {
+    this.segmentWriteOutMedium = segmentWriteOutMedium;
+    this.scratch = ByteBuffer.allocate(1 << logScratchSize).order(byteOrder);
+    this.bucketSize = bucketSize;
+    this.byteOrder = byteOrder;
+    this.bucketBuffer = new byte[bucketSize][];
+  }
+
+  @Override
+  public void open() throws IOException
+  {
+    headerOut = segmentWriteOutMedium.makeWriteOutBytes();
+    valuesOut = segmentWriteOutMedium.makeWriteOutBytes();
+  }
+
+  @Override
+  public void write(@Nullable byte[] value) throws IOException
+  {
+    if (prevObject != null && unsignedCompare(prevObject, value) >= 0) {
+      throw new ISE(
+          "Values must be sorted and unique. Element [%s] with value [%s] is 
before or equivalent to [%s]",
+          numWritten,
+          value == null ? null : StringUtils.fromUtf8(value),
+          StringUtils.fromUtf8(prevObject)
+      );
+    }
+
+    if (value == null) {
+      hasNulls = true;
+      return;
+    }
+
+    // if the bucket buffer is full, write the bucket
+    if (numWritten > 0 && (numWritten % bucketSize) == 0) {

Review Comment:
   I don't know that it really matters, but it looks like you are buffering a 
bucket and then writing the whole bucket out.  I don't think that's strictly 
necessary, is it?  That is, I think we can do a completely streaming write?  Or 
is there some state that you need to have which you can only get once the 
bucket is full?  I think even the offset written into the header could be 
figured out in a streaming fashion?
   
   I don't think it really matters, so I'm not actually asking you to change 
this.  Just wanting to validate my understanding.



##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedColumnPartSerde.java:
##########
@@ -305,6 +311,33 @@ public void read(ByteBuffer buffer, ColumnBuilder builder, 
ColumnConfig columnCo
 
         builder.setType(ValueType.STRING);
 
+        final int dictionaryStartPosition = buffer.position();
+        final byte dictionaryVersion = buffer.get();
+
+        if (dictionaryVersion == EncodedStringDictionaryWriter.VERSION) {
+          final byte encodingId = buffer.get();
+          if (encodingId == StringEncodingStrategy.FRONT_CODED_ID) {
+            readFrontCodedColumn(buffer, builder, rVersion, rFlags, 
hasMultipleValues);
+          } else {
+            throw new ISE("impossible, unknown encoding strategy id: %s", 
encodingId);

Review Comment:
   We should likely also support an ID for the GenericIndexed based approach so 
that in a future release we can protect from GenericIndexed getting a new 
version number by persisting those columns as just another encoding here.



##########
processing/src/main/java/org/apache/druid/segment/data/FrontCodedIndexedWriter.java:
##########
@@ -0,0 +1,296 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import com.google.common.primitives.Ints;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.io.Channels;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.io.smoosh.FileSmoosher;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+import org.apache.druid.segment.writeout.WriteOutBytes;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.ByteOrder;
+import java.nio.channels.WritableByteChannel;
+
+
+/**
+ * {@link DictionaryWriter} for a {@link FrontCodedIndexed}, written to a 
{@link SegmentWriteOutMedium}. Values MUST
+ * be added to this dictionary writer in sorted order, which is enforced.
+ *
+ * Front coding is a type of delta encoding for byte arrays, where values are 
grouped into buckets. The first value of
+ * the bucket is written entirely, and remaining values are stored as pairs of 
an integer which indicates how much
+ * of the first byte array of the bucket to use as a prefix, followed by the 
remaining value bytes after the prefix.
+ *
+ * This is valid to use for any values which can be compared byte by byte with 
unsigned comparison. Otherwise, this
+ * is not the collection for you.
+ */
+public class FrontCodedIndexedWriter implements DictionaryWriter<byte[]>
+{
+  private static final int MAX_LOG_BUFFER_SIZE = 26;
+  private final SegmentWriteOutMedium segmentWriteOutMedium;
+  private final int bucketSize;
+  private final ByteOrder byteOrder;
+  private final byte[][] bucketBuffer;
+  @Nullable
+  private byte[] prevObject = null;
+  @Nullable
+  private WriteOutBytes headerOut = null;
+  @Nullable
+  private WriteOutBytes valuesOut = null;
+  private int numWritten = 0;
+  private ByteBuffer scratch;
+  private int logScratchSize = 10;
+  private boolean isClosed = false;
+  private boolean hasNulls = false;
+
+
+  public FrontCodedIndexedWriter(
+      SegmentWriteOutMedium segmentWriteOutMedium,
+      ByteOrder byteOrder,
+      int bucketSize
+  )
+  {
+    this.segmentWriteOutMedium = segmentWriteOutMedium;
+    this.scratch = ByteBuffer.allocate(1 << logScratchSize).order(byteOrder);
+    this.bucketSize = bucketSize;
+    this.byteOrder = byteOrder;
+    this.bucketBuffer = new byte[bucketSize][];
+  }
+
+  @Override
+  public void open() throws IOException
+  {
+    headerOut = segmentWriteOutMedium.makeWriteOutBytes();
+    valuesOut = segmentWriteOutMedium.makeWriteOutBytes();
+  }
+
+  @Override
+  public void write(@Nullable byte[] value) throws IOException
+  {
+    if (prevObject != null && unsignedCompare(prevObject, value) >= 0) {
+      throw new ISE(
+          "Values must be sorted and unique. Element [%s] with value [%s] is 
before or equivalent to [%s]",
+          numWritten,
+          value == null ? null : StringUtils.fromUtf8(value),
+          StringUtils.fromUtf8(prevObject)
+      );
+    }
+
+    if (value == null) {
+      hasNulls = true;
+      return;
+    }
+
+    // if the bucket buffer is full, write the bucket
+    if (numWritten > 0 && (numWritten % bucketSize) == 0) {
+      resetScratch();
+      int written;
+      // write the bucket, growing scratch buffer as necessary
+      do {
+        written = writeBucket(scratch, bucketBuffer, bucketSize);
+        if (written < 0) {
+          growScratch();
+        }
+      } while (written < 0);
+      scratch.flip();
+      Channels.writeFully(valuesOut, scratch);
+
+      resetScratch();
+      // write end offset for current value
+      scratch.putInt((int) valuesOut.size());
+      scratch.flip();
+      Channels.writeFully(headerOut, scratch);
+    }
+
+    bucketBuffer[numWritten % bucketSize] = value;
+
+    ++numWritten;
+    prevObject = value;
+  }
+
+
+  @Override
+  public long getSerializedSize() throws IOException
+  {
+    if (!isClosed) {
+      flush();
+    }
+    int headerAndValues = Ints.checkedCast(headerOut.size() + 
valuesOut.size());
+    return Byte.BYTES +
+           Byte.BYTES +
+           Byte.BYTES +
+           VByte.estimateIntSize(numWritten) +
+           VByte.estimateIntSize(headerAndValues) +
+           headerAndValues;

Review Comment:
   I get scared of estimates of size, they are wrong sometimes.  You could 
alternatively build the header bytes (the 3 bytes and 2 VBytes) and then look 
at how many bytes it used to get at the size of the header, yes?



##########
processing/src/main/java/org/apache/druid/segment/data/EncodedStringDictionaryWriter.java:
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.io.smoosh.FileSmoosher;
+import org.apache.druid.segment.column.StringEncodingStrategy;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.channels.WritableByteChannel;
+
+public class EncodedStringDictionaryWriter implements DictionaryWriter<String>
+{
+  public static final byte VERSION = Byte.MAX_VALUE; // hopefully 
GenericIndexed never makes a version this high...

Review Comment:
   Why does it matter?  Even if GenericIndexed does, the thing that matters is 
that the version of the *column* itself never collides.  In order to ensure 
that, we should really start persisting all of the columns with the newly 
incremented version (even if they are still set to persist with 
GenericIndexed), but the downside to that is that it doesn't allow for 
rollback.  So, instead the new version can hopefully still persist a 
GenericIndexed version, we continue to make things with the old version in case 
we are in a state where we might want rollback and then in some future version, 
we should just never persist a String column without adding some 
String-column-specific version id.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] imply-cheddar commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

Reply via email to