imply-cheddar commented on code in PR #14014:
URL: https://github.com/apache/druid/pull/14014#discussion_r1155417652
##########
processing/src/main/java/org/apache/druid/data/input/impl/DimensionSchema.java:
##########
@@ -47,7 +51,9 @@
@JsonSubTypes.Type(name = DimensionSchema.FLOAT_TYPE_NAME, value =
FloatDimensionSchema.class),
@JsonSubTypes.Type(name = DimensionSchema.DOUBLE_TYPE_NAME, value =
DoubleDimensionSchema.class),
@JsonSubTypes.Type(name = DimensionSchema.SPATIAL_TYPE_NAME, value =
NewSpatialDimensionSchema.class),
- @JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value =
NestedDataDimensionSchema.class)
+ @JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value =
NestedDataDimensionSchema.class),
+ @JsonSubTypes.Type(name = StandardTypeColumnSchema.TYPE, value =
StandardTypeColumnSchema.class),
+ @JsonSubTypes.Type(name = "auto", value = StandardTypeColumnSchema.class)
Review Comment:
Why do we need this added as well? Shouldn't anything that is using this be
able to use `StandardTypeColumnSchema.TYPE` when it persists such that we don't
need to add 2 names for the same thing?
##########
processing/src/main/java/org/apache/druid/segment/DimensionIndexer.java:
##########
@@ -175,6 +182,18 @@ EncodedKeyComponent<EncodedKeyComponentType>
processRowValsToUnsortedEncodedKeyC
CloseableIndexed<ActualType> getSortedIndexedValues();
+ default SortedValueDictionary getSortedValueLookups()
+ {
+ throw new UnsupportedOperationException("Column does not support value
dictionaries.");
+ }
+
+ default void mergeNestedFields(SortedMap<String,
FieldTypeInfo.MutableTypeSet> mergedFields)
+ {
+ mergedFields.put(
+ NestedPathFinder.JSON_PATH_ROOT,
+ new
FieldTypeInfo.MutableTypeSet().add(getColumnCapabilities().toColumnType())
+ );
+ }
Review Comment:
This feel very nested-column-specific, but it's a change on the
`DimensionIndexer` interface. Does it really need to be here? Can't
`DimensionIndexer` instances make sure that they are all the same type and then
use concrete methods instead of leaking something like this on the interface?
##########
processing/src/main/java/org/apache/druid/segment/QueryableIndexIndexableAdapter.java:
##########
@@ -155,6 +160,37 @@ public void close() throws IOException
};
}
+ @Override
+ public SortedValueDictionary getSortedValueLookup(
+ String dimension,
+ SortedMap<String, FieldTypeInfo.MutableTypeSet> mergedFields
+ )
+ {
+ final ColumnHolder columnHolder = input.getColumnHolder(dimension);
+
+ if (columnHolder == null) {
+ return null;
+ }
+ if (!(columnHolder.getColumnFormat() instanceof
StandardTypeColumn.Format)) {
+ return null;
+ }
+
+ final BaseColumn col = columnHolder.getColumn();
+ if (col instanceof StandardTypeColumn) {
+ StandardTypeColumn column = (StandardTypeColumn) col;
+ column.mergeNestedFields(mergedFields);
+ return new SortedValueDictionary(
+ column.getStringDictionary(),
+ column.getLongDictionary(),
+ column.getDoubleDictionary(),
+ column.getArrayDictionary(),
+ column
+ );
+ }
+ // leaky leaky but this shouldn't be able to happen because of the format
check...
+ return null;
+ }
Review Comment:
Seeing the implementation of this method here feels really weird to me.
There's a comment about potentially leaking resources and an instanceof check
that is suspect...
It seems like it would be a lot cleaner if we could ask the
`IndexableAdapter` for something that is effectively a column, like a
`DimensionHandler` instance, then "merge" those instances and have that thing
give us a value dictionary if we need it. I think that if we do this, though,
we will find that the value dictionary no longer even needs to be exposed
outside of the concrete classes.
##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -263,19 +191,19 @@ private File makeIndexFiles(
log.debug("Completed factory.json in %,d millis",
System.currentTimeMillis() - startTime);
progress.progress();
- final Map<String, TypeSignature<ValueType>> metricTypes = new
TreeMap<>(Comparators.naturalNullsFirst());
- final List<ColumnCapabilities> dimCapabilities =
Lists.newArrayListWithCapacity(mergedDimensions.size());
- mergeCapabilities(adapters, mergedDimensions, metricTypes,
dimCapabilities);
+ final Map<String, ColumnFormat> metricTypes = new
TreeMap<>(Comparators.naturalNullsFirst());
+ final List<ColumnFormat> dimFormats =
Lists.newArrayListWithCapacity(mergedDimensions.size());
+ mergeFormat(adapters, mergedDimensions, metricTypes, dimFormats);
- final Map<String, DimensionHandler> handlers =
makeDimensionHandlers(mergedDimensions, dimCapabilities);
+ final Map<String, DimensionHandler> handlers =
makeDimensionHandlers(mergedDimensions, dimFormats);
final List<DimensionMergerV9> mergers = new ArrayList<>();
for (int i = 0; i < mergedDimensions.size(); i++) {
DimensionHandler handler = handlers.get(mergedDimensions.get(i));
mergers.add(
handler.makeMerger(
indexSpec,
segmentWriteOutMedium,
- dimCapabilities.get(i),
+ dimFormats.get(i).toColumnCapabilities(),
progress,
closer
)
Review Comment:
The `handler` came from the `dimFormat`, right? Why would it need the
capabilities from `dimFormat`? Can we not eliminate the argument entirely?
##########
processing/src/main/java/org/apache/druid/segment/StandardTypeColumnIndexer.java:
##########
@@ -0,0 +1,612 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment;
+
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.MutableBitmap;
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.UOE;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExpressionType;
+import org.apache.druid.query.dimension.DimensionSpec;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.column.ColumnCapabilities;
+import org.apache.druid.segment.column.ColumnCapabilitiesImpl;
+import org.apache.druid.segment.column.ColumnFormat;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.data.CloseableIndexed;
+import org.apache.druid.segment.incremental.IncrementalIndex;
+import org.apache.druid.segment.incremental.IncrementalIndexRowHolder;
+import org.apache.druid.segment.nested.FieldTypeInfo;
+import org.apache.druid.segment.nested.NestedPathFinder;
+import org.apache.druid.segment.nested.NestedPathPart;
+import org.apache.druid.segment.nested.SortedValueDictionary;
+import org.apache.druid.segment.nested.StructuredData;
+import org.apache.druid.segment.nested.StructuredDataProcessor;
+import org.apache.druid.segment.nested.ValueDictionary;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.SortedMap;
+import java.util.TreeMap;
+
+public class StandardTypeColumnIndexer implements
DimensionIndexer<StructuredData, StructuredData, StructuredData>
Review Comment:
Is this basically just the nested column that understands arrays? If so,
can we call it just NestedV2 or something like that instead of "standard"?
"Standard" is not a great name as, the moment we come up with some other idea,
we will have this "standard" thing lying around that we don't commonly use.
Much better to name it based on what it actually does.
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnBuilder.java:
##########
@@ -116,6 +116,13 @@ public ColumnBuilder setNumericColumnSupplier(Supplier<?
extends NumericColumn>
return this;
}
+ public ColumnBuilder setStandardTypeColumnSupplier(Supplier<? extends
StandardTypeColumn> columnSupplier)
+ {
+ checkColumnSupplierNotSet();
+ this.columnSupplier = columnSupplier;
+ return this;
+ }
Review Comment:
Why is this special? Isn't it just another format?
##########
processing/src/main/java/org/apache/druid/segment/data/FixedIndexed.java:
##########
@@ -168,6 +170,43 @@ public Iterator<T> iterator()
return IndexedIterable.create(this).iterator();
}
+
+
+ public IntIntPair getRange(
+ @Nullable T startValue,
+ boolean startStrict,
Review Comment:
instead of string/non-strict, I think `inclusive` or `exclusive` would be a
more common naming for the intent that I believe this and `endStrict` represent.
##########
processing/src/main/java/org/apache/druid/segment/data/Indexed.java:
##########
@@ -36,6 +39,43 @@
@PublicApi
public interface Indexed<T> extends Iterable<T>, HotLoopCallee
{
+ static <T> Indexed<T> empty()
+ {
+ return new Indexed<T>()
+ {
+ @Override
+ public int size()
+ {
+ return 0;
+ }
+
+ @Nullable
+ @Override
+ public T get(int index)
+ {
+ Indexed.checkIndex(index, 0);
+ return null;
+ }
+
+ @Override
+ public int indexOf(@Nullable T value)
+ {
+ return -1;
+ }
+
+ @Override
+ public Iterator<T> iterator()
+ {
+ return Collections.emptyIterator();
+ }
+
+ @Override
+ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
+ {
+ // nothing to inspect
+ }
+ };
+ }
Review Comment:
Might as well make it a static instance. No real point in having more than
one of these in the JVM.
##########
processing/src/main/java/org/apache/druid/segment/nested/DictionaryIdLookup.java:
##########
@@ -69,6 +69,16 @@ public void addString(@Nullable String value)
stringLookup.put(value, id);
}
+ // used when there are no string values to ensure that 0 is used for the
null value
+ public void addNumericNull()
+ {
+ Preconditions.checkState(
+ stringLookup.size() == 0 && longLookup.size() == 0 &&
doubleLookup.size() == 0,
+ "Lookup must be empty to add implicit null"
+ );
+ dictionarySize++;
+ }
+
Review Comment:
Why not add this in the constructor and just make it impossible to create a
DictionaryIdLookup without `null`?
##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -335,7 +263,7 @@ private File makeIndexFiles(
// shouldStore AND hasOnlyNulls
ColumnDescriptor columnDesc = ColumnDescriptor
.builder()
- .setValueType(dimCapabilities.get(i).getType())
+ .setValueType(dimFormats.get(i).getLogicalType().getType())
.addSerde(new NullColumnPartSerde(indexMergeResult.rowCount,
indexSpec.getBitmapSerdeFactory()))
.build();
Review Comment:
Why do we need to store the type of the `null` column? Maybe we need a
`null` type that indicates that it's always null and we can store it like that?
##########
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndex.java:
##########
@@ -494,16 +477,21 @@ public ColumnCapabilities
getColumnHandlerCapabilities(String columnName)
// for this method though, we want the 'normal' type name for the
capabilities, since this is the true 'output'
// type of the column, so use the type from the MetricDesc instead,
which is computed by round-tripping through
// something like
ComplexMetrics.getSerdeForType(valueType.getComplexTypeName()).getTypeName()
- return ColumnCapabilitiesImpl.copyOf(capabilities)
-
.setType(ColumnType.ofComplex(metricDescs.get(columnName).getType()));
+ return new CapabilitiesBasedFormat(
+ ColumnCapabilitiesImpl.snapshot(
+ ColumnCapabilitiesImpl.copyOf(capabilities)
+
.setType(ColumnType.ofComplex(metricDescs.get(columnName).getType())),
+ ColumnCapabilitiesImpl.ALL_FALSE
+ )
+ );
Review Comment:
Do we really need to be making new ones of these over and over again? Can
we not do it once and then just use that?
##########
processing/src/main/java/org/apache/druid/segment/DimensionIndexer.java:
##########
@@ -237,9 +256,14 @@ ColumnValueSelector<?> makeColumnValueSelector(
ColumnCapabilities getColumnCapabilities();
- default ColumnCapabilities getHandlerCapabilities()
+ default ColumnFormat getFormat()
{
- return getColumnCapabilities();
+ return new CapabilitiesBasedFormat(
+ ColumnCapabilitiesImpl.snapshot(
+ getColumnCapabilities(),
+ CapabilitiesBasedFormat.DIMENSION_CAPABILITY_MERGE_LOGIC
Review Comment:
A bit of a nit, but it seems really weird to have the
`CapabilitiesBasedFormat` need to be given logic from itself? Maybe create a
static `CapabilitiesBasedFormat.snapshot(ColumnCapabilities)` that does this on
its own and keeps the implementation details local and private?
##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -508,7 +436,7 @@ private void makeMetricsColumns(
GenericColumnSerializer writer = metWriters.get(i);
final ColumnDescriptor.Builder builder = ColumnDescriptor.builder();
- TypeSignature<ValueType> type = metricsTypes.get(metric);
+ TypeSignature<ValueType> type =
metricsTypes.get(metric).getLogicalType();
Review Comment:
Naming nit: should `metricTypes` be renamed to `metricFormats`?
##########
processing/src/main/java/org/apache/druid/segment/StandardTypeColumnHandler.java:
##########
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment;
+
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.java.util.common.io.Closer;
+import org.apache.druid.query.dimension.DefaultDimensionSpec;
+import org.apache.druid.query.dimension.DimensionSpec;
+import org.apache.druid.segment.column.ColumnCapabilities;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.nested.StructuredData;
+import org.apache.druid.segment.selector.settable.SettableColumnValueSelector;
+import
org.apache.druid.segment.selector.settable.SettableObjectColumnValueSelector;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+
+import java.util.Comparator;
+
+public class StandardTypeColumnHandler implements
DimensionHandler<StructuredData, StructuredData, StructuredData>
Review Comment:
From the name, I initially thought that this was the shim layer to bring the
legacy columns forward into the world with `ColumnFormat`, but then I saw that
this hard codes returning `NESTED_DATA` as the type and it has `StructuredData`
in all of the generic argument locations, which now makes me wonder why this
isn't called the `NestedTypeColumnHandler` or something that is named after
what it's hard coding everything into?
##########
processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java:
##########
@@ -251,31 +232,47 @@ public Class<StructuredData> classOfObject()
}
};
}
-
- @Override
- public ColumnCapabilities getColumnCapabilities()
+ private ColumnType getLogicalType()
{
if (fieldIndexers.size() == 1 &&
fieldIndexers.containsKey(NestedPathFinder.JSON_PATH_ROOT)) {
FieldIndexer rootField =
fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT);
- if (rootField.isSingleType()) {
- return ColumnCapabilitiesImpl.createDefault()
-
.setType(rootField.getTypes().getSingleType())
- .setHasNulls(hasNulls);
- }
+ ColumnType singleType = rootField.getTypes().getSingleType();
+ return singleType == null ? ColumnType.NESTED_DATA : singleType;
}
- return ColumnCapabilitiesImpl.createDefault()
- .setType(NestedDataComplexTypeSerde.TYPE)
- .setHasNulls(hasNulls);
+ return ColumnType.NESTED_DATA;
}
@Override
- public ColumnCapabilities getHandlerCapabilities()
+ public ColumnCapabilities getColumnCapabilities()
{
return ColumnCapabilitiesImpl.createDefault()
- .setType(NestedDataComplexTypeSerde.TYPE)
+ .setType(getLogicalType())
.setHasNulls(hasNulls);
}
+ @Override
+ public ColumnFormat getFormat()
+ {
+ return new NestedDataComplexTypeSerde.LegacyNestedColumnFormat();
Review Comment:
Why does this need to make a new instance on each call? Is the format
itself stateful or something that makes it not thread-safe?
##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -890,100 +818,30 @@ private void writeDimValuesAndSetupDimConversion(
progress.stopSection(section);
}
- private void mergeCapabilities(
+ private void mergeFormat(
final List<IndexableAdapter> adapters,
final List<String> mergedDimensions,
- final Map<String, TypeSignature<ValueType>> metricTypes,
- final List<ColumnCapabilities> dimCapabilities
+ final Map<String, ColumnFormat> metricTypes,
+ final List<ColumnFormat> dimFormats
)
{
- final Map<String, ColumnCapabilities> capabilitiesMap = new HashMap<>();
+ final Map<String, ColumnFormat> columnFormats = new HashMap<>();
for (IndexableAdapter adapter : adapters) {
for (String dimension : adapter.getDimensionNames()) {
- ColumnCapabilities capabilities = adapter.getCapabilities(dimension);
- capabilitiesMap.compute(dimension, (d, existingCapabilities) ->
- mergeCapabilities(capabilities, existingCapabilities,
DIMENSION_CAPABILITY_MERGE_LOGIC)
- );
+ ColumnFormat format = adapter.getFormat(dimension);
+ columnFormats.compute(dimension, (d, existingFormat) -> existingFormat
== null ? format : format.merge(existingFormat));
}
for (String metric : adapter.getMetricNames()) {
- final ColumnCapabilities capabilities =
adapter.getCapabilities(metric);
- final ColumnCapabilities merged = capabilitiesMap.compute(metric, (m,
existingCapabilities) ->
- mergeCapabilities(capabilities, existingCapabilities,
METRIC_CAPABILITY_MERGE_LOGIC)
+ final ColumnFormat format = adapter.getFormat(metric);
+ final ColumnFormat merged = columnFormats.compute(metric, (m,
existingFormat) ->
+ existingFormat == null ? format : format.merge(existingFormat)
);
+
metricTypes.put(metric, merged);
}
}
for (String dim : mergedDimensions) {
- dimCapabilities.add(capabilitiesMap.get(dim));
- }
- }
Review Comment:
Do we still need the separation between Metrics and Dimensions here? It
seems like we could put all of the columns into a single `Map<String,
ColumnFormat>`? This code is effectively building a single unified Map and
then separating it out as a side-effect to put into other data structures,
which makes for a really weird signature. We could just have this build and
return `columnFormats`, which would make things a bit cleaner, I think.
##########
processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java:
##########
@@ -251,31 +232,47 @@ public Class<StructuredData> classOfObject()
}
};
}
-
- @Override
- public ColumnCapabilities getColumnCapabilities()
+ private ColumnType getLogicalType()
{
if (fieldIndexers.size() == 1 &&
fieldIndexers.containsKey(NestedPathFinder.JSON_PATH_ROOT)) {
FieldIndexer rootField =
fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT);
- if (rootField.isSingleType()) {
- return ColumnCapabilitiesImpl.createDefault()
-
.setType(rootField.getTypes().getSingleType())
- .setHasNulls(hasNulls);
- }
+ ColumnType singleType = rootField.getTypes().getSingleType();
+ return singleType == null ? ColumnType.NESTED_DATA : singleType;
}
- return ColumnCapabilitiesImpl.createDefault()
- .setType(NestedDataComplexTypeSerde.TYPE)
- .setHasNulls(hasNulls);
+ return ColumnType.NESTED_DATA;
}
@Override
- public ColumnCapabilities getHandlerCapabilities()
+ public ColumnCapabilities getColumnCapabilities()
{
return ColumnCapabilitiesImpl.createDefault()
- .setType(NestedDataComplexTypeSerde.TYPE)
+ .setType(getLogicalType())
.setHasNulls(hasNulls);
}
+ @Override
+ public ColumnFormat getFormat()
+ {
+ return new NestedDataComplexTypeSerde.LegacyNestedColumnFormat();
+ }
+
+ @Override
+ public SortedValueDictionary getSortedValueLookups()
+ {
+ return globalDictionary.getSortedCollector();
+ }
+
+ @Override
+ public void mergeNestedFields(SortedMap<String,
FieldTypeInfo.MutableTypeSet> mergedFields)
+ {
+ for (Map.Entry<String, FieldIndexer> entry : fieldIndexers.entrySet()) {
+ // skip adding the field if no types are in the set, meaning only null
values have been processed
+ if (!entry.getValue().getTypes().isEmpty()) {
+ mergedFields.put(entry.getKey(), entry.getValue().getTypes());
+ }
+ }
+ }
Review Comment:
So, are the semantics of this method that it's adding the fields in
preparation for data to come in? I.e. it will have added all of these fields,
but at the end of the method the Indexer is in a transitional state, where it's
added the knowledge of the fields, but still has never seen any data with the
new fields, right? Mostly just trying to make sure I'm understanding.
##########
processing/src/main/java/org/apache/druid/segment/StandardTypeColumnSchema.java:
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.nested.StructuredData;
+
+/**
+ * Common {@link DimensionSchema} for ingestion of 'standard' Druid built-in
{@link ColumnType} datatypes.
+ *
+ * Automatically determines the most appropriate type for the given input
data, able to produce columns of type:
+ * {@link ColumnType#STRING}
+ * {@link ColumnType#STRING_ARRAY}
+ * {@link ColumnType#LONG}
+ * {@link ColumnType#LONG_ARRAY}
+ * {@link ColumnType#DOUBLE}
+ * {@link ColumnType#DOUBLE_ARRAY}
+ * {@link ColumnType#NESTED_DATA}
+ *
+ * and includes bitmap value set indexes. Input of mixed type will be stored
as {@link ColumnType#NESTED_DATA}.
+ *
+ * @see StandardTypeColumnIndexer
+ * @see StandardTypeColumnMerger
+ * @see org.apache.druid.segment.serde.StandardTypeColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardArrayColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardDoubleColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardLongColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardNestedColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardStringColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardTypeColumnPartSerde
+ * @see org.apache.druid.segment.column.StandardTypeColumn
+ */
+public class StandardTypeColumnSchema extends DimensionSchema
+{
+ public static final String TYPE = "standard";
Review Comment:
Okay, I now understand this as the "well, we are using the nested column to
identify singularly-typed columns now, so how about we call that standard"
object. Let's not name it that way, let's just name it `nestedv5`. In some
future, if/when we come up with some other different, better way to auto-detect
and deal with all of the columns (or just another evolution that requires us to
come up with a new name for compatibility reasons), it will be much better to
have it named based on what it's doing rather than how the code believes it is
using the thing at the time.
##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -235,6 +237,18 @@ public int getUnsortedEncodedKeyComponentHashCode(int[]
key)
return Arrays.hashCode(key);
}
+ @Override
+ public SortedValueDictionary getSortedValueLookups()
+ {
+ return new SortedValueDictionary(
+ getSortedIndexedValues(),
+ Indexed.empty(),
+ Indexed.empty(),
+ Indexed.empty(),
+ null
+ );
+ }
Review Comment:
It seems really weird that this is returning a thing with all of these
Indexed things. It's leaking implementation details of the nested column onto
columns that have absolutely nothing to do with the nested column. It's
important that a column is a world unto itself and that we don't leak
implementation specifics across the columns.
I think we need to kill this method and find another way of doing things. I
offered a solution in another comment. Another option would be for the nested
column itself to know what a `StringDimensionIndexer` is and when it merges
itself with a `StringDimensionIndexer`, it just knows how to get the relevant
information that it needs.
##########
processing/src/main/java/org/apache/druid/segment/column/CachingStringDictionaryEncodedColumn.java:
##########
@@ -54,7 +54,7 @@
/**
*
*/
-public class StringDictionaryEncodedColumn implements
DictionaryEncodedColumn<String>
+public class CachingStringDictionaryEncodedColumn implements
DictionaryEncodedColumn<String>
Review Comment:
Why the rename?
##########
processing/src/main/java/org/apache/druid/segment/data/FixedIndexed.java:
##########
@@ -73,7 +75,7 @@ public static <T> Supplier<FixedIndexed<T>> read(ByteBuffer
bb, TypeStrategy<T>
valuesOffset
);
- bb.position(buffer.position() + (width * size));
+ bb.position(buffer.position() + (width * (hasNull ? size - 1 : size)));
Review Comment:
It's unclear to me how this code can be correct in this PR and not a bug in
the currently running code that didn't `- 1` when nulls were present, which
makes me wonder if this isn't creating an off-by-one with the code in this PR?
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnFormat.java:
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.segment.DimensionHandler;
+
+import javax.annotation.Nullable;
+
+public interface ColumnFormat
+{
+ ColumnType getLogicalType();
+
+ ColumnCapabilities toColumnCapabilities();
+
+ DimensionHandler getColumnHandler(String columnName);
+
+ DimensionSchema getColumnSchema(String columnName);
Review Comment:
Why do these take a columnName argument? Is it because `DimensionHandler`
and `DimensionSchema` have getName methods and this is basically just passing
it through to those?
Or, is a `ColumnFormat` object intended to be a factory or have a reference
to the underlying source of columns and be reaching into that to create
different instances based on different column names provided? If this latter
one, what is expected when you ask for a column that isn't of the specific
concrete format class that you are dealing with?
Either way, please add javadoc on this interface explaining it.
##########
processing/src/main/java/org/apache/druid/segment/data/FixedIndexed.java:
##########
@@ -168,6 +170,43 @@ public Iterator<T> iterator()
return IndexedIterable.create(this).iterator();
}
+
+
+ public IntIntPair getRange(
+ @Nullable T startValue,
+ boolean startStrict,
+ @Nullable T endValue,
+ boolean endStrict
+ )
+ {
+ final int firstValue = hasNull ? 1 : 0;
+ int startIndex, endIndex;
+ if (startValue == null) {
+ startIndex = firstValue;
+ } else {
+ final int found = indexOf(startValue);
+ if (found >= firstValue) {
+ startIndex = startStrict ? found + 1 : found;
+ } else {
+ startIndex = -(found + 1);
Review Comment:
It's unclear to me what the intentions of this method are? Is it trying to
find offsets that could be used by, e.g. a ranged filter to see what to include
or not? If so, having this return negative values seems weird to me. I'm not
sure what other purpose it could have though, so I'm also not succeeding in
coming up with a world where I want the negative values...
##########
processing/src/main/java/org/apache/druid/segment/column/CapabilitiesBasedFormat.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.segment.DimensionHandler;
+import org.apache.druid.segment.DimensionHandlerUtils;
+
+import javax.annotation.Nullable;
+import java.util.Objects;
+
+public class CapabilitiesBasedFormat implements ColumnFormat
+{
+ // merge logic for the state capabilities will be in after incremental index
is persisted
+ public static final ColumnCapabilities.CoercionLogic
DIMENSION_CAPABILITY_MERGE_LOGIC =
+ new ColumnCapabilities.CoercionLogic()
+ {
+ @Override
+ public boolean dictionaryEncoded()
+ {
+ return true;
+ }
+
+ @Override
+ public boolean dictionaryValuesSorted()
+ {
+ return true;
+ }
+
+ @Override
+ public boolean dictionaryValuesUnique()
+ {
+ return true;
+ }
+
+ @Override
+ public boolean multipleValues()
+ {
+ return false;
+ }
+
+ @Override
+ public boolean hasNulls()
+ {
+ return false;
+ }
+ };
+ private final ColumnCapabilities capabilities;
+
+ public CapabilitiesBasedFormat(ColumnCapabilities capabilities)
+ {
+ this.capabilities = capabilities;
+ }
+
+ @Override
+ public DimensionHandler getColumnHandler(String columnName)
+ {
+ return DimensionHandlerUtils.getHandlerFromCapabilities(columnName,
capabilities, null);
+ }
+
+ @Override
+ public DimensionSchema getColumnSchema(String columnName)
+ {
+ return getColumnHandler(columnName).getDimensionSchema(capabilities);
+ }
+
+ @Override
+ public ColumnFormat merge(@Nullable ColumnFormat otherFormat)
+ {
+ if (otherFormat == null) {
+ return this;
+ }
+
+ ColumnCapabilitiesImpl merged =
ColumnCapabilitiesImpl.copyOf(this.toColumnCapabilities());
+ ColumnCapabilitiesImpl otherSnapshot =
ColumnCapabilitiesImpl.copyOf(otherFormat.toColumnCapabilities());
+
+ if (!Objects.equals(merged.getType(), otherSnapshot.getType())
+ || !Objects.equals(merged.getElementType(),
otherSnapshot.getElementType())) {
+ final String mergedType = merged.getType() == null ? null :
merged.asTypeString();
+ final String otherType = otherSnapshot.getType() == null ? null :
otherSnapshot.asTypeString();
+ throw new ISE(
+ "Cannot merge columns of type[%s] and [%s]",
+ mergedType,
+ otherType
+ );
+ } else if (!Objects.equals(merged.getComplexTypeName(),
otherSnapshot.getComplexTypeName())) {
+ throw new ISE(
+ "Cannot merge columns of type[%s] and [%s]",
+ merged.getComplexTypeName(),
+ otherSnapshot.getComplexTypeName()
+ );
+ }
+
+
merged.setDictionaryEncoded(merged.isDictionaryEncoded().or(otherSnapshot.isDictionaryEncoded()).isTrue());
+
merged.setHasMultipleValues(merged.hasMultipleValues().or(otherSnapshot.hasMultipleValues()).isTrue());
+ merged.setDictionaryValuesSorted(
+
merged.areDictionaryValuesSorted().and(otherSnapshot.areDictionaryValuesSorted()).isTrue()
+ );
+ merged.setDictionaryValuesUnique(
+
merged.areDictionaryValuesUnique().and(otherSnapshot.areDictionaryValuesUnique()).isTrue()
+ );
+
merged.setHasNulls(merged.hasNulls().or(otherSnapshot.hasNulls()).isTrue());
+ // When merging persisted queryableIndexes in the same ingestion job,
+ // all queryableIndexes should have the exact same hasBitmapIndexes flag
set which is set in the ingestionSpec.
+ // One exception is null-only columns as they always do NOT have bitmap
indexes no matter whether the flag is set
+ // in the ingestionSpec. As a result, the mismatch checked in the if
clause below can happen
+ // when one of the columnCapability is from a real column and another is
from a null-only column.
+ // See NullColumnPartSerde for how columnCapability is created for
null-only columns.
+ // When the mismatch is found, we prefer the flag set in the ingestionSpec
over
+ // the columnCapability of null-only columns.
+ if (merged.hasBitmapIndexes() != otherSnapshot.hasBitmapIndexes()) {
+ merged.setHasBitmapIndexes(false);
+ }
Review Comment:
This code and comment seem weird to me. If I'm understanding it correctly,
if the column is ever null, then as it gets recompacted, the fact that the null
column doesn't have indexes will end up infecting the compaction and cause us
to not index the fields because if there's a mismatch we forcibly turn indexing
off?
Why wouldn't we force it to be `true` if any of them is `true` and only push
it to `false` if all of them are false?
##########
processing/src/main/java/org/apache/druid/segment/data/GenericIndexedWriter.java:
##########
@@ -301,6 +301,12 @@ public T get(int index) throws IOException
return strategy.fromByteBuffer(bb, valueSize);
}
+ @Override
+ public int getCardinality()
+ {
+ return numWritten;
+ }
+
Review Comment:
`numWritten` is not necessarily equivalent to the cardinality. a
`GenericIndexedWriter` could absolutely write the same exact value out multiple
times, meaning that the cardinality is strictly less than the number of things
written. I'm not sure how important this distinction is though? And, if this
is trying to understand the number of things written, then perhaps `size()` or
`numRows()` or something like that would be a better name?
##########
processing/src/main/java/org/apache/druid/segment/serde/ColumnPartSerde.java:
##########
@@ -53,6 +54,6 @@
interface Deserializer
{
- void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig
columnConfig);
+ void read(String columnName, ByteBuffer buffer, ColumnBuilder builder,
ColumnConfig columnConfig);
Review Comment:
Let's not add the extra parameter here.
##########
processing/src/main/java/org/apache/druid/segment/serde/ColumnPartSerde.java:
##########
@@ -39,7 +39,8 @@
@JsonSubTypes.Type(name = "floatV2", value =
FloatNumericColumnPartSerdeV2.class),
@JsonSubTypes.Type(name = "longV2", value =
LongNumericColumnPartSerdeV2.class),
@JsonSubTypes.Type(name = "doubleV2", value =
DoubleNumericColumnPartSerdeV2.class),
- @JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class)
+ @JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class),
+ @JsonSubTypes.Type(name = "standard", value =
StandardTypeColumnPartSerde.class)
Review Comment:
What're the semantics of the standard column part serde trying to do? If
it's standard, does it subsume all of the others?
##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedColumnPartSerde.java:
##########
@@ -294,7 +294,7 @@ public Deserializer getDeserializer()
return new Deserializer()
{
@Override
- public void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig
columnConfig)
+ public void read(String columnName, ByteBuffer buffer, ColumnBuilder
builder, ColumnConfig columnConfig)
Review Comment:
Once we remove the argument from `ComplexMetricSerde` we don't need it here
anymore either.
##########
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndex.java:
##########
@@ -592,11 +580,7 @@ IncrementalIndexRowResult toIncrementalIndexRow(InputRow
row)
wasNewDim = true;
final DimensionHandler<?, ?, ?> handler;
if (useSchemaDiscovery) {
- handler = DimensionHandlerUtils.getHandlerFromCapabilities(
- dimension,
-
makeDefaultCapabilitiesFromValueType(NestedDataComplexTypeSerde.TYPE),
- null
- );
+ handler = new StandardTypeColumnHandler(dimension);
Review Comment:
Perhaps overkill for now, but I think that we can make IncrementalIndex take
a factory (either a specific Factory style class or a `Function<String,
DimensionHandler>`) which is injected from Guice and used whenever a new column
is discovered. Will generally make our code cleaner as well, as it'll make a
lot less temptation to assume that there's some "standard" implementation.
##########
processing/src/test/java/org/apache/druid/segment/NestedDataColumnIndexerTest.java:
##########
@@ -94,34 +94,34 @@ public void testKeySizeEstimation()
// new raw value, new fields
key =
indexer.processRowValsToUnsortedEncodedKeyComponent(ImmutableList.of(1L, 2L,
10L), false);
- Assert.assertEquals(168, key.getEffectiveSizeBytes());
- Assert.assertEquals(6, indexer.getCardinality());
+ Assert.assertEquals(276, key.getEffectiveSizeBytes());
+ Assert.assertEquals(5, indexer.getCardinality());
Review Comment:
Is the different in numbers because the tests was validating v5, now it's
validating v4?
##########
processing/src/main/java/org/apache/druid/segment/serde/ComplexMetricSerde.java:
##########
@@ -51,6 +51,21 @@
* @param builder ColumnBuilder to add the column to
* @param columnConfig ColumnConfiguration used during deserialization
*/
+ public void deserializeColumn(
+ @SuppressWarnings("unused") String columnName,
+ ByteBuffer buffer,
+ ColumnBuilder builder,
+ ColumnConfig columnConfig
+ )
+ {
+ deserializeColumn(buffer, builder, columnConfig);
+ }
Review Comment:
I don't believe that we need to add this columnName parameter. So far, the
serde code has generally kept the name disassociated from the actual storage of
the column as the name of the column is actually unimportant for the ability to
serialize and deserialize the bytes.
It looks like this is being done because the column name is being used as a
prefix on the other names in the file smoosher, that makes sense, but at this
point it's not a "column name" as much as a "unique prefix". Given that it is
a prefix that we expect this column to use on all of its things, I think it
makes sense to serialize out the unique prefix as part of the bytes of the
column itself and then read it back in from there. Let it be coincidence that
the unique prefix just so happens to be the same thing as the column name.
This is pretty similar to how it was already working in the "older" versions
where it got the information by deserializing a metadata object...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]