[GitHub] [druid] clintropolis commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

GitBox Fri, 14 Oct 2022 03:27:32 -0700


clintropolis commented on code in PR #12277:
URL: https://github.com/apache/druid/pull/12277#discussion_r995610335



##########
processing/src/main/java/org/apache/druid/segment/data/EncodedStringDictionaryWriter.java:
##########
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.data;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.io.smoosh.FileSmoosher;
+import org.apache.druid.segment.column.StringEncodingStrategy;
+
+import javax.annotation.Nullable;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.channels.WritableByteChannel;
+
+public class EncodedStringDictionaryWriter implements DictionaryWriter<String>
+{
+  public static final byte VERSION = Byte.MAX_VALUE; // hopefully 
GenericIndexed never makes a version this high...

Review Comment:
   so, I did not actually introduce a new version of the whole string column 
format in this PR, instead opting to just carve out the ability to customize 
how we write and read the value dictionary, and re-using everything else in the 
format.
   
   To update the actual string column version, I think since version is also 
currently loaded with whether or not the int column part is compressed, I would 
either need to:
   * add two versions for compressed/uncompressed
   * make more dramatic modifications to the column part serde so that 
compression is not stored in the version byte
   * or just write a totally new string column part serde for v4 to cleanup 
some of this stuff (this might be best actually, but I'm unsure I want to do as 
part of this PR)
   
   When not using front-coding it was important to make sure we keep writing 
columns exactly the same so that clusters can roll back to older versions and 
still read their segments, so I was trying to minimize the amount of code 
needed to support this.
   
   The way I've done it here (assuming I actually add the code to handle 
`StringEncodingStrategy.UTF8_ID` as a `GenericIndexed` per your other 
[comment](https://github.com/apache/druid/pull/12277#discussion_r995420389)), 
then at some point in the future we could just start always writing 
`EncodedStringDictionaryWriter.VERSION` before the encoding id, and get by 
without having to change the actual whole column version, though we have to 
spend that byte which would effectively become static until we do.
   
   Though writing all of this out, I'm thinking a bit longer term its probably 
best to make a truly new version of the string column to try to clean some 
stuff up like decoupling column version from compression, etc, but I'd like to 
think a bit more about what else might be missing (index stuff in particular 
seems like could use some thought since we've opened up a lot of future 
possibilities with our recent refactoring in the area).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

Reply via email to