[GitHub] [druid] imply-cheddar commented on a diff in pull request #14014: remix nested columns

via GitHub Sun, 02 Apr 2023 21:43:59 -0700


imply-cheddar commented on code in PR #14014:
URL: https://github.com/apache/druid/pull/14014#discussion_r1155417652



##########
processing/src/main/java/org/apache/druid/data/input/impl/DimensionSchema.java:
##########
@@ -47,7 +51,9 @@
     @JsonSubTypes.Type(name = DimensionSchema.FLOAT_TYPE_NAME, value = 
FloatDimensionSchema.class),
     @JsonSubTypes.Type(name = DimensionSchema.DOUBLE_TYPE_NAME, value = 
DoubleDimensionSchema.class),
     @JsonSubTypes.Type(name = DimensionSchema.SPATIAL_TYPE_NAME, value = 
NewSpatialDimensionSchema.class),
-    @JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = 
NestedDataDimensionSchema.class)
+    @JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = 
NestedDataDimensionSchema.class),
+    @JsonSubTypes.Type(name = StandardTypeColumnSchema.TYPE, value = 
StandardTypeColumnSchema.class),
+    @JsonSubTypes.Type(name = "auto", value = StandardTypeColumnSchema.class)

Review Comment:
   Why do we need this added as well?  Shouldn't anything that is using this be 
able to use `StandardTypeColumnSchema.TYPE` when it persists such that we don't 
need to add 2 names for the same thing?



##########
processing/src/main/java/org/apache/druid/segment/DimensionIndexer.java:
##########
@@ -175,6 +182,18 @@ EncodedKeyComponent<EncodedKeyComponentType> 
processRowValsToUnsortedEncodedKeyC
   CloseableIndexed<ActualType> getSortedIndexedValues();
 
 
+  default SortedValueDictionary getSortedValueLookups()
+  {
+    throw new UnsupportedOperationException("Column does not support value 
dictionaries.");
+  }
+
+  default void mergeNestedFields(SortedMap<String, 
FieldTypeInfo.MutableTypeSet> mergedFields)
+  {
+    mergedFields.put(
+        NestedPathFinder.JSON_PATH_ROOT,
+        new 
FieldTypeInfo.MutableTypeSet().add(getColumnCapabilities().toColumnType())
+    );
+  }

Review Comment:
   This feel very nested-column-specific, but it's a change on the 
`DimensionIndexer` interface.  Does it really need to be here?  Can't 
`DimensionIndexer` instances make sure that they are all the same type and then 
use concrete methods instead of leaking something like this on the interface?



##########
processing/src/main/java/org/apache/druid/segment/QueryableIndexIndexableAdapter.java:
##########
@@ -155,6 +160,37 @@ public void close() throws IOException
     };
   }
 
+  @Override
+  public SortedValueDictionary getSortedValueLookup(
+      String dimension,
+      SortedMap<String, FieldTypeInfo.MutableTypeSet> mergedFields
+  )
+  {
+    final ColumnHolder columnHolder = input.getColumnHolder(dimension);
+
+    if (columnHolder == null) {
+      return null;
+    }
+    if (!(columnHolder.getColumnFormat() instanceof 
StandardTypeColumn.Format)) {
+      return null;
+    }
+
+    final BaseColumn col = columnHolder.getColumn();
+    if (col instanceof StandardTypeColumn) {
+      StandardTypeColumn column = (StandardTypeColumn) col;
+      column.mergeNestedFields(mergedFields);
+      return new SortedValueDictionary(
+          column.getStringDictionary(),
+          column.getLongDictionary(),
+          column.getDoubleDictionary(),
+          column.getArrayDictionary(),
+          column
+      );
+    }
+    // leaky leaky but this shouldn't be able to happen because of the format 
check...
+    return null;
+  }

Review Comment:
   Seeing the implementation of this method here feels really weird to me.  
There's a comment about potentially leaking resources and an instanceof check 
that is suspect...
   
   It seems like it would be a lot cleaner if we could ask the 
`IndexableAdapter` for something that is effectively a column, like a 
`DimensionHandler` instance, then "merge" those instances and have that thing 
give us a value dictionary if we need it.  I think that if we do this, though, 
we will find that the value dictionary no longer even needs to be exposed 
outside of the concrete classes.
   
   
   
   



##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -263,19 +191,19 @@ private File makeIndexFiles(
       log.debug("Completed factory.json in %,d millis", 
System.currentTimeMillis() - startTime);
 
       progress.progress();
-      final Map<String, TypeSignature<ValueType>> metricTypes = new 
TreeMap<>(Comparators.naturalNullsFirst());
-      final List<ColumnCapabilities> dimCapabilities = 
Lists.newArrayListWithCapacity(mergedDimensions.size());
-      mergeCapabilities(adapters, mergedDimensions, metricTypes, 
dimCapabilities);
+      final Map<String, ColumnFormat> metricTypes = new 
TreeMap<>(Comparators.naturalNullsFirst());
+      final List<ColumnFormat> dimFormats = 
Lists.newArrayListWithCapacity(mergedDimensions.size());
+      mergeFormat(adapters, mergedDimensions, metricTypes, dimFormats);
 
-      final Map<String, DimensionHandler> handlers = 
makeDimensionHandlers(mergedDimensions, dimCapabilities);
+      final Map<String, DimensionHandler> handlers = 
makeDimensionHandlers(mergedDimensions, dimFormats);
       final List<DimensionMergerV9> mergers = new ArrayList<>();
       for (int i = 0; i < mergedDimensions.size(); i++) {
         DimensionHandler handler = handlers.get(mergedDimensions.get(i));
         mergers.add(
             handler.makeMerger(
                 indexSpec,
                 segmentWriteOutMedium,
-                dimCapabilities.get(i),
+                dimFormats.get(i).toColumnCapabilities(),
                 progress,
                 closer
             )

Review Comment:
   The `handler` came from the `dimFormat`, right?  Why would it need the 
capabilities from `dimFormat`?  Can we not eliminate the argument entirely?



##########
processing/src/main/java/org/apache/druid/segment/StandardTypeColumnIndexer.java:
##########
@@ -0,0 +1,612 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment;
+
+import org.apache.druid.collections.bitmap.BitmapFactory;
+import org.apache.druid.collections.bitmap.MutableBitmap;
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.UOE;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExpressionType;
+import org.apache.druid.query.dimension.DimensionSpec;
+import org.apache.druid.query.monomorphicprocessing.RuntimeShapeInspector;
+import org.apache.druid.segment.column.ColumnCapabilities;
+import org.apache.druid.segment.column.ColumnCapabilitiesImpl;
+import org.apache.druid.segment.column.ColumnFormat;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.data.CloseableIndexed;
+import org.apache.druid.segment.incremental.IncrementalIndex;
+import org.apache.druid.segment.incremental.IncrementalIndexRowHolder;
+import org.apache.druid.segment.nested.FieldTypeInfo;
+import org.apache.druid.segment.nested.NestedPathFinder;
+import org.apache.druid.segment.nested.NestedPathPart;
+import org.apache.druid.segment.nested.SortedValueDictionary;
+import org.apache.druid.segment.nested.StructuredData;
+import org.apache.druid.segment.nested.StructuredDataProcessor;
+import org.apache.druid.segment.nested.ValueDictionary;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import java.util.SortedMap;
+import java.util.TreeMap;
+
+public class StandardTypeColumnIndexer implements 
DimensionIndexer<StructuredData, StructuredData, StructuredData>

Review Comment:
   Is this basically just the nested column that understands arrays?  If so, 
can we call it just NestedV2 or something like that instead of "standard"?  
"Standard" is not a great name as, the moment we come up with some other idea, 
we will have this "standard" thing lying around that we don't commonly use.  
Much better to name it based on what it actually does.



##########
processing/src/main/java/org/apache/druid/segment/column/ColumnBuilder.java:
##########
@@ -116,6 +116,13 @@ public ColumnBuilder setNumericColumnSupplier(Supplier<? 
extends NumericColumn>
     return this;
   }
 
+  public ColumnBuilder setStandardTypeColumnSupplier(Supplier<? extends 
StandardTypeColumn> columnSupplier)
+  {
+    checkColumnSupplierNotSet();
+    this.columnSupplier = columnSupplier;
+    return this;
+  }

Review Comment:
   Why is this special?  Isn't it just another format?



##########
processing/src/main/java/org/apache/druid/segment/data/FixedIndexed.java:
##########
@@ -168,6 +170,43 @@ public Iterator<T> iterator()
     return IndexedIterable.create(this).iterator();
   }
 
+
+
+  public IntIntPair getRange(
+      @Nullable T startValue,
+      boolean startStrict,

Review Comment:
   instead of string/non-strict, I think `inclusive` or `exclusive` would be a 
more common naming for the intent that I believe this and `endStrict` represent.



##########
processing/src/main/java/org/apache/druid/segment/data/Indexed.java:
##########
@@ -36,6 +39,43 @@
 @PublicApi
 public interface Indexed<T> extends Iterable<T>, HotLoopCallee
 {
+  static <T> Indexed<T> empty()
+  {
+    return new Indexed<T>()
+    {
+      @Override
+      public int size()
+      {
+        return 0;
+      }
+
+      @Nullable
+      @Override
+      public T get(int index)
+      {
+        Indexed.checkIndex(index, 0);
+        return null;
+      }
+
+      @Override
+      public int indexOf(@Nullable T value)
+      {
+        return -1;
+      }
+
+      @Override
+      public Iterator<T> iterator()
+      {
+        return Collections.emptyIterator();
+      }
+
+      @Override
+      public void inspectRuntimeShape(RuntimeShapeInspector inspector)
+      {
+        // nothing to inspect
+      }
+    };
+  }

Review Comment:
   Might as well make it a static instance.  No real point in having more than 
one of these in the JVM.



##########
processing/src/main/java/org/apache/druid/segment/nested/DictionaryIdLookup.java:
##########
@@ -69,6 +69,16 @@ public void addString(@Nullable String value)
     stringLookup.put(value, id);
   }
 
+  // used when there are no string values to ensure that 0 is used for the 
null value
+  public void addNumericNull()
+  {
+    Preconditions.checkState(
+        stringLookup.size() == 0 && longLookup.size() == 0 && 
doubleLookup.size() == 0,
+        "Lookup must be empty to add implicit null"
+    );
+    dictionarySize++;
+  }
+

Review Comment:
   Why not add this in the constructor and just make it impossible to create a 
DictionaryIdLookup without `null`?



##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -335,7 +263,7 @@ private File makeIndexFiles(
           // shouldStore AND hasOnlyNulls
           ColumnDescriptor columnDesc = ColumnDescriptor
               .builder()
-              .setValueType(dimCapabilities.get(i).getType())
+              .setValueType(dimFormats.get(i).getLogicalType().getType())
               .addSerde(new NullColumnPartSerde(indexMergeResult.rowCount, 
indexSpec.getBitmapSerdeFactory()))
               .build();

Review Comment:
   Why do we need to store the type of the `null` column?  Maybe we need a 
`null` type that indicates that it's always null and we can store it like that?



##########
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndex.java:
##########
@@ -494,16 +477,21 @@ public ColumnCapabilities 
getColumnHandlerCapabilities(String columnName)
         // for this method though, we want the 'normal' type name for the 
capabilities, since this is the true 'output'
         // type of the column, so use the type from the MetricDesc instead, 
which is computed by round-tripping through
         // something like 
ComplexMetrics.getSerdeForType(valueType.getComplexTypeName()).getTypeName()
-        return ColumnCapabilitiesImpl.copyOf(capabilities)
-                                     
.setType(ColumnType.ofComplex(metricDescs.get(columnName).getType()));
+        return new CapabilitiesBasedFormat(
+            ColumnCapabilitiesImpl.snapshot(
+                ColumnCapabilitiesImpl.copyOf(capabilities)
+                                      
.setType(ColumnType.ofComplex(metricDescs.get(columnName).getType())),
+                ColumnCapabilitiesImpl.ALL_FALSE
+            )
+        );

Review Comment:
   Do we really need to be making new ones of these over and over again?  Can 
we not do it once and then just use that?



##########
processing/src/main/java/org/apache/druid/segment/DimensionIndexer.java:
##########
@@ -237,9 +256,14 @@ ColumnValueSelector<?> makeColumnValueSelector(
 
   ColumnCapabilities getColumnCapabilities();
 
-  default ColumnCapabilities getHandlerCapabilities()
+  default ColumnFormat getFormat()
   {
-    return getColumnCapabilities();
+    return new CapabilitiesBasedFormat(
+        ColumnCapabilitiesImpl.snapshot(
+            getColumnCapabilities(),
+            CapabilitiesBasedFormat.DIMENSION_CAPABILITY_MERGE_LOGIC

Review Comment:
   A bit of a nit, but it seems really weird to have the 
`CapabilitiesBasedFormat` need to be given logic from itself?  Maybe create a 
static `CapabilitiesBasedFormat.snapshot(ColumnCapabilities)` that does this on 
its own and keeps the implementation details local and private?



##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -508,7 +436,7 @@ private void makeMetricsColumns(
       GenericColumnSerializer writer = metWriters.get(i);
 
       final ColumnDescriptor.Builder builder = ColumnDescriptor.builder();
-      TypeSignature<ValueType> type = metricsTypes.get(metric);
+      TypeSignature<ValueType> type = 
metricsTypes.get(metric).getLogicalType();

Review Comment:
   Naming nit: should `metricTypes` be renamed to `metricFormats`?



##########
processing/src/main/java/org/apache/druid/segment/StandardTypeColumnHandler.java:
##########
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment;
+
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.java.util.common.io.Closer;
+import org.apache.druid.query.dimension.DefaultDimensionSpec;
+import org.apache.druid.query.dimension.DimensionSpec;
+import org.apache.druid.segment.column.ColumnCapabilities;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.nested.StructuredData;
+import org.apache.druid.segment.selector.settable.SettableColumnValueSelector;
+import 
org.apache.druid.segment.selector.settable.SettableObjectColumnValueSelector;
+import org.apache.druid.segment.writeout.SegmentWriteOutMedium;
+
+import java.util.Comparator;
+
+public class StandardTypeColumnHandler implements 
DimensionHandler<StructuredData, StructuredData, StructuredData>

Review Comment:
   From the name, I initially thought that this was the shim layer to bring the 
legacy columns forward into the world with `ColumnFormat`, but then I saw that 
this hard codes returning `NESTED_DATA` as the type and it has `StructuredData` 
in all of the generic argument locations, which now makes me wonder why this 
isn't called the `NestedTypeColumnHandler` or something that is named after 
what it's hard coding everything into?



##########
processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java:
##########
@@ -251,31 +232,47 @@ public Class<StructuredData> classOfObject()
       }
     };
   }
-
-  @Override
-  public ColumnCapabilities getColumnCapabilities()
+  private ColumnType getLogicalType()
   {
     if (fieldIndexers.size() == 1 && 
fieldIndexers.containsKey(NestedPathFinder.JSON_PATH_ROOT)) {
       FieldIndexer rootField = 
fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT);
-      if (rootField.isSingleType()) {
-        return ColumnCapabilitiesImpl.createDefault()
-                                     
.setType(rootField.getTypes().getSingleType())
-                                     .setHasNulls(hasNulls);
-      }
+      ColumnType singleType = rootField.getTypes().getSingleType();
+      return singleType == null ? ColumnType.NESTED_DATA : singleType;
     }
-    return ColumnCapabilitiesImpl.createDefault()
-                                 .setType(NestedDataComplexTypeSerde.TYPE)
-                                 .setHasNulls(hasNulls);
+    return ColumnType.NESTED_DATA;
   }
 
   @Override
-  public ColumnCapabilities getHandlerCapabilities()
+  public ColumnCapabilities getColumnCapabilities()
   {
     return ColumnCapabilitiesImpl.createDefault()
-                                 .setType(NestedDataComplexTypeSerde.TYPE)
+                                 .setType(getLogicalType())
                                  .setHasNulls(hasNulls);
   }
 
+  @Override
+  public ColumnFormat getFormat()
+  {
+    return new NestedDataComplexTypeSerde.LegacyNestedColumnFormat();

Review Comment:
   Why does this need to make a new instance on each call?  Is the format 
itself stateful or something that makes it not thread-safe?



##########
processing/src/main/java/org/apache/druid/segment/IndexMergerV9.java:
##########
@@ -890,100 +818,30 @@ private void writeDimValuesAndSetupDimConversion(
     progress.stopSection(section);
   }
 
-  private void mergeCapabilities(
+  private void mergeFormat(
       final List<IndexableAdapter> adapters,
       final List<String> mergedDimensions,
-      final Map<String, TypeSignature<ValueType>> metricTypes,
-      final List<ColumnCapabilities> dimCapabilities
+      final Map<String, ColumnFormat> metricTypes,
+      final List<ColumnFormat> dimFormats
   )
   {
-    final Map<String, ColumnCapabilities> capabilitiesMap = new HashMap<>();
+    final Map<String, ColumnFormat> columnFormats = new HashMap<>();
     for (IndexableAdapter adapter : adapters) {
       for (String dimension : adapter.getDimensionNames()) {
-        ColumnCapabilities capabilities = adapter.getCapabilities(dimension);
-        capabilitiesMap.compute(dimension, (d, existingCapabilities) ->
-            mergeCapabilities(capabilities, existingCapabilities, 
DIMENSION_CAPABILITY_MERGE_LOGIC)
-        );
+        ColumnFormat format = adapter.getFormat(dimension);
+        columnFormats.compute(dimension, (d, existingFormat) -> existingFormat 
== null ? format : format.merge(existingFormat));
       }
       for (String metric : adapter.getMetricNames()) {
-        final ColumnCapabilities capabilities = 
adapter.getCapabilities(metric);
-        final ColumnCapabilities merged = capabilitiesMap.compute(metric, (m, 
existingCapabilities) ->
-            mergeCapabilities(capabilities, existingCapabilities, 
METRIC_CAPABILITY_MERGE_LOGIC)
+        final ColumnFormat format = adapter.getFormat(metric);
+        final ColumnFormat merged = columnFormats.compute(metric, (m, 
existingFormat) ->
+            existingFormat == null ? format : format.merge(existingFormat)
         );
+
         metricTypes.put(metric, merged);
       }
     }
     for (String dim : mergedDimensions) {
-      dimCapabilities.add(capabilitiesMap.get(dim));
-    }
-  }

Review Comment:
   Do we still need the separation between Metrics and Dimensions here?  It 
seems like we could put all of the columns into a single `Map<String, 
ColumnFormat>`?  This code is effectively building a single unified Map and 
then separating it out as a side-effect to put into other data structures, 
which makes for a really weird signature.  We could just have this build and 
return `columnFormats`, which would make things a bit cleaner, I think.



##########
processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java:
##########
@@ -251,31 +232,47 @@ public Class<StructuredData> classOfObject()
       }
     };
   }
-
-  @Override
-  public ColumnCapabilities getColumnCapabilities()
+  private ColumnType getLogicalType()
   {
     if (fieldIndexers.size() == 1 && 
fieldIndexers.containsKey(NestedPathFinder.JSON_PATH_ROOT)) {
       FieldIndexer rootField = 
fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT);
-      if (rootField.isSingleType()) {
-        return ColumnCapabilitiesImpl.createDefault()
-                                     
.setType(rootField.getTypes().getSingleType())
-                                     .setHasNulls(hasNulls);
-      }
+      ColumnType singleType = rootField.getTypes().getSingleType();
+      return singleType == null ? ColumnType.NESTED_DATA : singleType;
     }
-    return ColumnCapabilitiesImpl.createDefault()
-                                 .setType(NestedDataComplexTypeSerde.TYPE)
-                                 .setHasNulls(hasNulls);
+    return ColumnType.NESTED_DATA;
   }
 
   @Override
-  public ColumnCapabilities getHandlerCapabilities()
+  public ColumnCapabilities getColumnCapabilities()
   {
     return ColumnCapabilitiesImpl.createDefault()
-                                 .setType(NestedDataComplexTypeSerde.TYPE)
+                                 .setType(getLogicalType())
                                  .setHasNulls(hasNulls);
   }
 
+  @Override
+  public ColumnFormat getFormat()
+  {
+    return new NestedDataComplexTypeSerde.LegacyNestedColumnFormat();
+  }
+
+  @Override
+  public SortedValueDictionary getSortedValueLookups()
+  {
+    return globalDictionary.getSortedCollector();
+  }
+
+  @Override
+  public void mergeNestedFields(SortedMap<String, 
FieldTypeInfo.MutableTypeSet> mergedFields)
+  {
+    for (Map.Entry<String, FieldIndexer> entry : fieldIndexers.entrySet()) {
+      // skip adding the field if no types are in the set, meaning only null 
values have been processed
+      if (!entry.getValue().getTypes().isEmpty()) {
+        mergedFields.put(entry.getKey(), entry.getValue().getTypes());
+      }
+    }
+  }

Review Comment:
   So, are the semantics of this method that it's adding the fields in 
preparation for data to come in?  I.e. it will have added all of these fields, 
but at the end of the method the Indexer is in a transitional state, where it's 
added the knowledge of the fields, but still has never seen any data with the 
new fields, right?  Mostly just trying to make sure I'm understanding.



##########
processing/src/main/java/org/apache/druid/segment/StandardTypeColumnSchema.java:
##########
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.nested.StructuredData;
+
+/**
+ * Common {@link DimensionSchema} for ingestion of 'standard' Druid built-in 
{@link ColumnType} datatypes.
+ *
+ * Automatically determines the most appropriate type for the given input 
data, able to produce columns of type:
+ *    {@link ColumnType#STRING}
+ *    {@link ColumnType#STRING_ARRAY}
+ *    {@link ColumnType#LONG}
+ *    {@link ColumnType#LONG_ARRAY}
+ *    {@link ColumnType#DOUBLE}
+ *    {@link ColumnType#DOUBLE_ARRAY}
+ *    {@link ColumnType#NESTED_DATA}
+ *
+ * and includes bitmap value set indexes. Input of mixed type will be stored 
as {@link ColumnType#NESTED_DATA}.
+ *
+ * @see StandardTypeColumnIndexer
+ * @see StandardTypeColumnMerger
+ * @see org.apache.druid.segment.serde.StandardTypeColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardArrayColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardDoubleColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardLongColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardNestedColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardStringColumnSerializer
+ * @see org.apache.druid.segment.serde.StandardTypeColumnPartSerde
+ * @see org.apache.druid.segment.column.StandardTypeColumn
+ */
+public class StandardTypeColumnSchema extends DimensionSchema
+{
+  public static final String TYPE = "standard";

Review Comment:
   Okay, I now understand this as the "well, we are using the nested column to 
identify singularly-typed columns now, so how about we call that standard" 
object.  Let's not name it that way, let's just name it `nestedv5`.  In some 
future, if/when we come up with some other different, better way to auto-detect 
and deal with all of the columns (or just another evolution that requires us to 
come up with a new name for compatibility reasons), it will be much better to 
have it named based on what it's doing rather than how the code believes it is 
using the thing at the time.



##########
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java:
##########
@@ -235,6 +237,18 @@ public int getUnsortedEncodedKeyComponentHashCode(int[] 
key)
     return Arrays.hashCode(key);
   }
 
+  @Override
+  public SortedValueDictionary getSortedValueLookups()
+  {
+    return new SortedValueDictionary(
+        getSortedIndexedValues(),
+        Indexed.empty(),
+        Indexed.empty(),
+        Indexed.empty(),
+        null
+    );
+  }

Review Comment:
   It seems really weird that this is returning a thing with all of these 
Indexed things.  It's leaking implementation details of the nested column onto 
columns that have absolutely nothing to do with the nested column.  It's 
important that a column is a world unto itself and that we don't leak 
implementation specifics across the columns.
   
   I think we need to kill this method and find another way of doing things.  I 
offered a solution in another comment.  Another option would be for the nested 
column itself to know what a `StringDimensionIndexer` is and when it merges 
itself with a `StringDimensionIndexer`, it just knows how to get the relevant 
information that it needs.



##########
processing/src/main/java/org/apache/druid/segment/column/CachingStringDictionaryEncodedColumn.java:
##########
@@ -54,7 +54,7 @@
 /**
  *
  */
-public class StringDictionaryEncodedColumn implements 
DictionaryEncodedColumn<String>
+public class CachingStringDictionaryEncodedColumn implements 
DictionaryEncodedColumn<String>

Review Comment:
   Why the rename?



##########
processing/src/main/java/org/apache/druid/segment/data/FixedIndexed.java:
##########
@@ -73,7 +75,7 @@ public static <T> Supplier<FixedIndexed<T>> read(ByteBuffer 
bb, TypeStrategy<T>
         valuesOffset
     );
 
-    bb.position(buffer.position() + (width * size));
+    bb.position(buffer.position() + (width * (hasNull ? size - 1 : size)));

Review Comment:
   It's unclear to me how this code can be correct in this PR and not a bug in 
the currently running code that didn't `- 1` when nulls were present, which 
makes me wonder if this isn't creating an off-by-one with the code in this PR?



##########
processing/src/main/java/org/apache/druid/segment/column/ColumnFormat.java:
##########
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.segment.DimensionHandler;
+
+import javax.annotation.Nullable;
+
+public interface ColumnFormat
+{
+  ColumnType getLogicalType();
+
+  ColumnCapabilities toColumnCapabilities();
+
+  DimensionHandler getColumnHandler(String columnName);
+
+  DimensionSchema getColumnSchema(String columnName);

Review Comment:
   Why do these take a columnName argument?  Is it because `DimensionHandler` 
and `DimensionSchema` have getName methods and this is basically just passing 
it through to those?  
   
   Or, is a `ColumnFormat` object intended to be a factory or have a reference 
to the underlying source of columns and be reaching into that to create 
different instances based on different column names provided?  If this latter 
one, what is expected when you ask for a column that isn't of the specific 
concrete format class that you are dealing with?
   
   Either way, please add javadoc on this interface explaining it.



##########
processing/src/main/java/org/apache/druid/segment/data/FixedIndexed.java:
##########
@@ -168,6 +170,43 @@ public Iterator<T> iterator()
     return IndexedIterable.create(this).iterator();
   }
 
+
+
+  public IntIntPair getRange(
+      @Nullable T startValue,
+      boolean startStrict,
+      @Nullable T endValue,
+      boolean endStrict
+  )
+  {
+    final int firstValue = hasNull ? 1 : 0;
+    int startIndex, endIndex;
+    if (startValue == null) {
+      startIndex = firstValue;
+    } else {
+      final int found = indexOf(startValue);
+      if (found >= firstValue) {
+        startIndex = startStrict ? found + 1 : found;
+      } else {
+        startIndex = -(found + 1);

Review Comment:
   It's unclear to me what the intentions of this method are?  Is it trying to 
find offsets that could be used by, e.g. a ranged filter to see what to include 
or not?  If so, having this return negative values seems weird to me.  I'm not 
sure what other purpose it could have though, so I'm also not succeeding in 
coming up with a world where I want the negative values...



##########
processing/src/main/java/org/apache/druid/segment/column/CapabilitiesBasedFormat.java:
##########
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.segment.column;
+
+import org.apache.druid.data.input.impl.DimensionSchema;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.segment.DimensionHandler;
+import org.apache.druid.segment.DimensionHandlerUtils;
+
+import javax.annotation.Nullable;
+import java.util.Objects;
+
+public class CapabilitiesBasedFormat implements ColumnFormat
+{
+  // merge logic for the state capabilities will be in after incremental index 
is persisted
+  public static final ColumnCapabilities.CoercionLogic 
DIMENSION_CAPABILITY_MERGE_LOGIC =
+      new ColumnCapabilities.CoercionLogic()
+      {
+        @Override
+        public boolean dictionaryEncoded()
+        {
+          return true;
+        }
+
+        @Override
+        public boolean dictionaryValuesSorted()
+        {
+          return true;
+        }
+
+        @Override
+        public boolean dictionaryValuesUnique()
+        {
+          return true;
+        }
+
+        @Override
+        public boolean multipleValues()
+        {
+          return false;
+        }
+
+        @Override
+        public boolean hasNulls()
+        {
+          return false;
+        }
+      };
+  private final ColumnCapabilities capabilities;
+
+  public CapabilitiesBasedFormat(ColumnCapabilities capabilities)
+  {
+    this.capabilities = capabilities;
+  }
+
+  @Override
+  public DimensionHandler getColumnHandler(String columnName)
+  {
+    return DimensionHandlerUtils.getHandlerFromCapabilities(columnName, 
capabilities, null);
+  }
+
+  @Override
+  public DimensionSchema getColumnSchema(String columnName)
+  {
+    return getColumnHandler(columnName).getDimensionSchema(capabilities);
+  }
+
+  @Override
+  public ColumnFormat merge(@Nullable ColumnFormat otherFormat)
+  {
+    if (otherFormat == null) {
+      return this;
+    }
+
+    ColumnCapabilitiesImpl merged = 
ColumnCapabilitiesImpl.copyOf(this.toColumnCapabilities());
+    ColumnCapabilitiesImpl otherSnapshot = 
ColumnCapabilitiesImpl.copyOf(otherFormat.toColumnCapabilities());
+
+    if (!Objects.equals(merged.getType(), otherSnapshot.getType())
+        || !Objects.equals(merged.getElementType(), 
otherSnapshot.getElementType())) {
+      final String mergedType = merged.getType() == null ? null : 
merged.asTypeString();
+      final String otherType = otherSnapshot.getType() == null ? null : 
otherSnapshot.asTypeString();
+      throw new ISE(
+          "Cannot merge columns of type[%s] and [%s]",
+          mergedType,
+          otherType
+      );
+    } else if (!Objects.equals(merged.getComplexTypeName(), 
otherSnapshot.getComplexTypeName())) {
+      throw new ISE(
+          "Cannot merge columns of type[%s] and [%s]",
+          merged.getComplexTypeName(),
+          otherSnapshot.getComplexTypeName()
+      );
+    }
+
+    
merged.setDictionaryEncoded(merged.isDictionaryEncoded().or(otherSnapshot.isDictionaryEncoded()).isTrue());
+    
merged.setHasMultipleValues(merged.hasMultipleValues().or(otherSnapshot.hasMultipleValues()).isTrue());
+    merged.setDictionaryValuesSorted(
+        
merged.areDictionaryValuesSorted().and(otherSnapshot.areDictionaryValuesSorted()).isTrue()
+    );
+    merged.setDictionaryValuesUnique(
+        
merged.areDictionaryValuesUnique().and(otherSnapshot.areDictionaryValuesUnique()).isTrue()
+    );
+    
merged.setHasNulls(merged.hasNulls().or(otherSnapshot.hasNulls()).isTrue());
+    // When merging persisted queryableIndexes in the same ingestion job,
+    // all queryableIndexes should have the exact same hasBitmapIndexes flag 
set which is set in the ingestionSpec.
+    // One exception is null-only columns as they always do NOT have bitmap 
indexes no matter whether the flag is set
+    // in the ingestionSpec. As a result, the mismatch checked in the if 
clause below can happen
+    // when one of the columnCapability is from a real column and another is 
from a null-only column.
+    // See NullColumnPartSerde for how columnCapability is created for 
null-only columns.
+    // When the mismatch is found, we prefer the flag set in the ingestionSpec 
over
+    // the columnCapability of null-only columns.
+    if (merged.hasBitmapIndexes() != otherSnapshot.hasBitmapIndexes()) {
+      merged.setHasBitmapIndexes(false);
+    }

Review Comment:
   This code and comment seem weird to me.  If I'm understanding it correctly, 
if the column is ever null, then as it gets recompacted, the fact that the null 
column doesn't have indexes will end up infecting the compaction and cause us 
to not index the fields because if there's a mismatch we forcibly turn indexing 
off?
   
   Why wouldn't we force it to be `true` if any of them is `true` and only push 
it to `false` if all of them are false?



##########
processing/src/main/java/org/apache/druid/segment/data/GenericIndexedWriter.java:
##########
@@ -301,6 +301,12 @@ public T get(int index) throws IOException
     return strategy.fromByteBuffer(bb, valueSize);
   }
 
+  @Override
+  public int getCardinality()
+  {
+    return numWritten;
+  }
+

Review Comment:
   `numWritten` is not necessarily equivalent to the cardinality.  a 
`GenericIndexedWriter` could absolutely write the same exact value out multiple 
times, meaning that the cardinality is strictly less than the number of things 
written.  I'm not sure how important this distinction is though?  And, if this 
is trying to understand the number of things written, then perhaps `size()` or 
`numRows()` or something like that would be a better name?



##########
processing/src/main/java/org/apache/druid/segment/serde/ColumnPartSerde.java:
##########
@@ -53,6 +54,6 @@
 
   interface Deserializer
   {
-    void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig 
columnConfig);
+    void read(String columnName, ByteBuffer buffer, ColumnBuilder builder, 
ColumnConfig columnConfig);

Review Comment:
   Let's not add the extra parameter here.



##########
processing/src/main/java/org/apache/druid/segment/serde/ColumnPartSerde.java:
##########
@@ -39,7 +39,8 @@
     @JsonSubTypes.Type(name = "floatV2", value = 
FloatNumericColumnPartSerdeV2.class),
     @JsonSubTypes.Type(name = "longV2", value = 
LongNumericColumnPartSerdeV2.class),
     @JsonSubTypes.Type(name = "doubleV2", value = 
DoubleNumericColumnPartSerdeV2.class),
-    @JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class)
+    @JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class),
+    @JsonSubTypes.Type(name = "standard", value = 
StandardTypeColumnPartSerde.class)

Review Comment:
   What're the semantics of the standard column part serde trying to do?  If 
it's standard, does it subsume all of the others?



##########
processing/src/main/java/org/apache/druid/segment/serde/DictionaryEncodedColumnPartSerde.java:
##########
@@ -294,7 +294,7 @@ public Deserializer getDeserializer()
     return new Deserializer()
     {
       @Override
-      public void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig 
columnConfig)
+      public void read(String columnName, ByteBuffer buffer, ColumnBuilder 
builder, ColumnConfig columnConfig)

Review Comment:
   Once we remove the argument from `ComplexMetricSerde` we don't need it here 
anymore either.



##########
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndex.java:
##########
@@ -592,11 +580,7 @@ IncrementalIndexRowResult toIncrementalIndexRow(InputRow 
row)
           wasNewDim = true;
           final DimensionHandler<?, ?, ?> handler;
           if (useSchemaDiscovery) {
-            handler = DimensionHandlerUtils.getHandlerFromCapabilities(
-                dimension,
-                
makeDefaultCapabilitiesFromValueType(NestedDataComplexTypeSerde.TYPE),
-                null
-            );
+            handler = new StandardTypeColumnHandler(dimension);

Review Comment:
   Perhaps overkill for now, but I think that we can make IncrementalIndex take 
a factory (either a specific Factory style class or a `Function<String, 
DimensionHandler>`) which is injected from Guice and used whenever a new column 
is discovered.  Will generally make our code cleaner as well, as it'll make a 
lot less temptation to assume that there's some "standard" implementation.



##########
processing/src/test/java/org/apache/druid/segment/NestedDataColumnIndexerTest.java:
##########
@@ -94,34 +94,34 @@ public void testKeySizeEstimation()
 
     // new raw value, new fields
     key = 
indexer.processRowValsToUnsortedEncodedKeyComponent(ImmutableList.of(1L, 2L, 
10L), false);
-    Assert.assertEquals(168, key.getEffectiveSizeBytes());
-    Assert.assertEquals(6, indexer.getCardinality());
+    Assert.assertEquals(276, key.getEffectiveSizeBytes());
+    Assert.assertEquals(5, indexer.getCardinality());

Review Comment:
   Is the different in numbers because the tests was validating v5, now it's 
validating v4?



##########
processing/src/main/java/org/apache/druid/segment/serde/ComplexMetricSerde.java:
##########
@@ -51,6 +51,21 @@
    * @param builder ColumnBuilder to add the column to
    * @param columnConfig ColumnConfiguration used during deserialization
    */
+  public void deserializeColumn(
+      @SuppressWarnings("unused") String columnName,
+      ByteBuffer buffer,
+      ColumnBuilder builder,
+      ColumnConfig columnConfig
+  )
+  {
+    deserializeColumn(buffer, builder, columnConfig);
+  }

Review Comment:
   I don't believe that we need to add this columnName parameter.  So far, the 
serde code has generally kept the name disassociated from the actual storage of 
the column as the name of the column is actually unimportant for the ability to 
serialize and deserialize the bytes.  
   
   It looks like this is being done because the column name is being used as a 
prefix on the other names in the file smoosher, that makes sense, but at this 
point it's not a "column name" as much as a "unique prefix".  Given that it is 
a prefix that we expect this column to use on all of its things, I think it 
makes sense to serialize out the unique prefix as part of the bytes of the 
column itself and then read it back in from there.  Let it be coincidence that 
the unique prefix just so happens to be the same thing as the column name.
   
   This is pretty similar to how it was already working in the "older" versions 
where it got the information by deserializing a metadata object...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] imply-cheddar commented on a diff in pull request #14014: remix nested columns

Reply via email to