gianm commented on a change in pull request #10219:
URL: https://github.com/apache/druid/pull/10219#discussion_r466140881
##########
File path:
processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java
##########
@@ -100,6 +100,11 @@ public long numRows(Segment segment)
Map<String, ColumnAnalysis> columns = new TreeMap<>();
+ Function<String, ColumnCapabilities> adapterCapabilitesFn =
Review comment:
I realize you just copied this code from somewhere else, but is there a
way we can do this without `instanceof`? Maybe a new method on StorageAdapter
with some nice javadocs? This code is pretty brittle otherwise.
##########
File path:
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedColumnPartSerde.java
##########
@@ -329,6 +331,7 @@ public void read(ByteBuffer buffer, ColumnBuilder builder,
ColumnConfig columnCo
);
builder
.setHasMultipleValues(hasMultipleValues)
+ .setNullable(firstDictionaryEntry == null)
Review comment:
What if it's a multi-value column with no explicit nulls, but some empty
rows? Two questions in that scenario:
1. Should a column like that report "yes" or "no" for having nulls? I'm
guessing "yes" makes sense, because if you filter on `col is null` then that
should match empty rows. So it stands to reason that it has nulls in some
sense, even though you won't see them if you walk through the column.
2. What will firstDictionaryEntry be?
##########
File path:
processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilitiesImpl.java
##########
@@ -72,22 +74,35 @@ public static ColumnCapabilitiesImpl snapshot(@Nullable
final ColumnCapabilities
copy.hasMultipleValues =
copy.hasMultipleValues.coerceUnknownToBoolean(unknownIsTrue);
copy.dictionaryValuesSorted =
copy.dictionaryValuesSorted.coerceUnknownToBoolean(unknownIsTrue);
copy.dictionaryValuesUnique =
copy.dictionaryValuesUnique.coerceUnknownToBoolean(unknownIsTrue);
+ copy.nullable = copy.nullable.coerceUnknownToBoolean(unknownIsTrue);
Review comment:
With this addition, I'm wondering if the `snapshot` method still makes
sense. What reason is there that callers generally want to set all these
unknown things to the same boolean? They don't seem incredibly related. Maybe
we should have callers explicitly resolve all the unknowns in particular
directions.
##########
File path:
processing/src/main/java/org/apache/druid/segment/serde/DoubleNumericColumnPartSerdeV2.java
##########
@@ -152,13 +152,17 @@ public Deserializer getDeserializer()
buffer.position(initialPos + offset);
final ImmutableBitmap bitmap;
+ final boolean isNullable;
if (buffer.hasRemaining()) {
bitmap =
bitmapSerdeFactory.getObjectStrategy().fromByteBufferWithSize(buffer);
+ isNullable = bitmap.size() > 0;
Review comment:
It would be better to use `!bitmap.isEmpty()`, because `bitmap.size()`
isn't always cached and could potentially require walking the bitmap. Similar
comment for the other serde types.
##########
File path:
processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java
##########
@@ -192,6 +193,7 @@ private ColumnAnalysis analyzeNumericColumn(
return new ColumnAnalysis(
capabilities.getType().name(),
capabilities.hasMultipleValues().isTrue(),
+ capabilities.isNullable().isTrue(),
Review comment:
When would it be unknown if a column has nulls?
Will bad stuff happen if we return `false` when it actually does have nulls,
but for some reason it's marked as unknown?
##########
File path:
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndex.java
##########
@@ -685,6 +685,9 @@ IncrementalIndexRowResult toIncrementalIndexRow(InputRow
row)
}
dimsKeySize += indexer.estimateEncodedKeyComponentSize(dimsKey);
// Set column capabilities as data is coming in
+ if (dimsKey == null) {
+ capabilities.setIsNullable(NullHandling.sqlCompatible());
Review comment:
If it's string won't it still count as a null, even in
replace-with-default mode?
Btw, consider adding a `NullHandling.isNullable(ValueType)` method if seems
useful.
##########
File path:
processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilities.java
##########
@@ -38,6 +38,7 @@
boolean hasSpatialIndexes();
Capable hasMultipleValues();
boolean isFilterable();
+ Capable isNullable();
Review comment:
Broad comment: "hasNulls" would be a better name for this property,
because it's meant to mean whether or not the column really has nulls in it
right now.
"isNullable" will make people think of the SQL sense of the term, which is a
different concept: it means a column could have nulls potentially, even if it
doesn't right now.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]