[GitHub] [druid] cheddar commented on a change in pull request #11888: add 'TypeStrategy' to types

GitBox Mon, 08 Nov 2021 19:10:10 -0800


cheddar commented on a change in pull request #11888:
URL: https://github.com/apache/druid/pull/11888#discussion_r745255446




##########
File path: core/src/main/java/org/apache/druid/segment/column/TypeStrategy.java
##########
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.common.config.NullHandling;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+import java.util.Comparator;
+
+/**
+ * TypeStrategy provides value comparison and binary serialization for Druid 
types. This can be obtained for ANY Druid
+ * type via {@link TypeSignature#getStrategy()}.
+ *
+ * Implementations of this mechanism support writing both null and non-null 
values. When using the 'nullable' family
+ * of the read and write methods, values are stored such that the leading byte 
contains either
+ * {@link NullHandling#IS_NULL_BYTE} or {@link NullHandling#IS_NOT_NULL_BYTE} 
as appropriate. The default
+ * implementations of these methods use masking to check the null bit, so 
flags may be used in the upper bits of the
+ * null byte.
+ *
+ * This mechanism allows using the natural {@link ByteBuffer#position()} and 
modify the underlying position as they
+ * operate, and also random access reads are specific offets, which do not 
modify the underlying position. If a method
+ * accepts an offset parameter, it does not modify the position, if not, it 
does.
+ *
+ * The only methods implementors are required to provide are {@link 
#read(ByteBuffer)},
+ * {@link #write(ByteBuffer, Object)} and {@link #estimateSizeBytes(Object)}, 
the rest provide default implementations
+ * which set the null/not null byte, and reset buffer positions as 
appropriate, but may be overridden if a more
+ * optimized implementation is needed.
+ */
+public interface TypeStrategy<T> extends Comparator<T>
+{
+  /**
+   * The size in bytes that writing this value to memory would require, useful 
for constraining the values maximum size
+   *
+   * This does not include the null byte, use {@link 
#estimateSizeBytesNullable(Object)} instead.
+   */
+  int estimateSizeBytes(@Nullable T value);
+
+  /**
+   * The size in bytes that writing this value to memory would require, 
including the null byte, useful for constraining
+   * the values maximum size. If the value is null, the size will be {@link 
Byte#BYTES}, otherwise it will be
+   * {@link Byte#BYTES} + {@link #estimateSizeBytes(Object)}
+   */
+  default int estimateSizeBytesNullable(@Nullable T value)

Review comment:
       I'm not sure about hand the null handling over to the implementation 
here.  The strategy as I understand it is using a whole byte for whether there 
is a null value.  A byte for whether it is null consumes 8x the space than it 
really needs, this can actually add up in unfortunate ways when there are lots 
of columns in the result set.  
   
   For example, a "better" implementation would be to start every row with a 
bitmap of all of the columns that are null.  This would consume only a single 
bit per column rather than a byte per column and, if we were so inclined, could 
also be properly padded to try to enforce word-alignment.
   
   Having an implementation that does it with a whole byte as a stepping stone 
is maybe okay.  But, with the way this interface is built, the interface is 
forcing rather sub-optimal null-handling on the query engines that use this 
interface.  I think it would be better to perhaps declare that a size of `-1` 
means that the value is equivalent to `null` and have the thing external from 
this do something meaningful with that knowledge.
   
   We would likely also need to adjust the signature of `write` to return an 
`int` to indicate the number of bytes written (once again, returning a `-1` for 
"it was null")




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] cheddar commented on a change in pull request #11888: add 'TypeStrategy' to types

Reply via email to