clintropolis commented on code in PR #12277: URL: https://github.com/apache/druid/pull/12277#discussion_r995610335
########## processing/src/main/java/org/apache/druid/segment/data/EncodedStringDictionaryWriter.java: ########## @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.druid.segment.data; + +import org.apache.druid.common.config.NullHandling; +import org.apache.druid.java.util.common.StringUtils; +import org.apache.druid.java.util.common.io.smoosh.FileSmoosher; +import org.apache.druid.segment.column.StringEncodingStrategy; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.WritableByteChannel; + +public class EncodedStringDictionaryWriter implements DictionaryWriter<String> +{ + public static final byte VERSION = Byte.MAX_VALUE; // hopefully GenericIndexed never makes a version this high... Review Comment: so, I did not actually introduce a new version of the whole string column format in this PR, instead opting to just carve out the ability to customize how we write and read the value dictionary, and re-using everything else in the format. To update the actual string column version, I think since version is also currently loaded with whether or not the int column part is compressed, I would either need to: * add two versions for compressed/uncompressed * make more dramatic modifications to the column part serde so that compression is not stored in the version byte * or just write a totally new string column part serde for v4 to cleanup some of this stuff (this might be best actually, but I'm unsure I want to do as part of this PR) When not using front-coding it was important to make sure we keep writing columns exactly the same so that clusters can roll back to older versions and still read their segments, so I was trying to minimize the amount of code needed to support this. The way I've done it here (assuming I actually add the code to handle `StringEncodingStrategy.UTF8_ID` as a `GenericIndexed` per your other [comment](https://github.com/apache/druid/pull/12277#discussion_r995420389)), then at some point in the future we could just start always writing `EncodedStringDictionaryWriter.VERSION` before the encoding id, and get by without having to change the actual whole column version, though we have to spend that byte which would effectively become static until we do. Though writing all of this out, I'm thinking a bit longer term its probably best to make a truly new version of the string column to try to clean some stuff up like decoupling column version from compression, etc, but I'd like to think a bit more about what else might be missing (index stuff in particular seems like could use some thought since we've opened up a lot of future possibilities with our recent refactoring in the area). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
