[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

via GitHub Tue, 23 May 2023 15:30:20 -0700


clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1203099360



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,118 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced 
with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a 
{@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of 
each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = 
LeastRestrictiveTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = 
FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = 
LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is 
effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically 
results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' 
data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, 
at least in systems using SQL schemas which
+   * are partially or fully computed by this cache, this merge policy can 
result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of 
some segments. If data is likely to vary
+   * in type across segments, consider using {@link 
LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "latestInterval";
+    private static final FirstTypeMergePolicy INSTANCE = new 
FirstTypeMergePolicy();
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || 
ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments 
list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, 
ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.
+   */
+  public static class LeastRestrictiveTypeMergePolicy implements 
ColumnTypeMergePolicy
+  {
+    public static final String NAME = "leastRestrictive";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      try {
+        return ColumnType.leastRestrictiveType(existingType, newType);
+      }
+      catch (Types.IncompatibleTypeException incompatibleTypeException) {
+        // fall back to first encountered type if they are not compatible for 
some reason
+        return FirstTypeMergePolicy.INSTANCE.merge(existingType, newType);

Review Comment:
   I wanted to catch instead of exploding here because the implication of 
exploding is that it interferes with computing the SQL schema, so it seemed 
better to fall back to just using the type from the most recent time chunk than 
to blow up.
   
   The specific scenario i was imagining is like what happens if someone wants 
to migrate between different sketch types. This is totally incompatible to do 
in the same column, but instead of blowing up the schema it seemed better to 
fall back to the existing behavior so at least the problem can be worked around 
(instead of breaking more stuff).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Reply via email to