[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

GitBox Tue, 26 May 2020 21:13:25 -0700


clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r430688872




##########
File path: 
processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilities.java
##########
@@ -57,6 +50,21 @@ public boolean isTrue()
       return this == TRUE;
     }
 
+    public boolean isMaybeTrue()
+    {
+      return isTrue() || isUnknown();
+    }
+
+    public boolean isUnknown()
+    {
+      return this == UNKNOWN;
+    }
+
+    public Capable complete(boolean convertUnknownToTrue)

Review comment:
       I don't think it needs another enum, if anything it could just accept a 
`Capable` to use for the default value, though I should enforce that it cannot 
be unknown. If i rename to `coerceIfUnknown` or something maybe it would also 
be more obvious what the purpose is.

##########
File path: 
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, 
IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null 
values for a row can potentially be
-    // represented by an empty array; see 
StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in 
index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column 
or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time 
construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = 
index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = 
index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == 
ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = 
ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), 
false);

Review comment:
       Ok, given the race condition I guess I have to revert this behavior so 
queries use the fake capabilities and to fix the bug for segment metadata based 
schema discovery to plumb special the actual capabilities to segment metadata 
queries. Maybe longer term it would be better if we could snapshot at the time 
we query and make a cursor to only read the rows that were available at the 
time the capabilities were fetched so that we can have more performant queries 
on incremental indexes at the cost of missing a few rows.

##########
File path: 
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, 
IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null 
values for a row can potentially be
-    // represented by an empty array; see 
StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in 
index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column 
or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time 
construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = 
index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = 
index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == 
ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = 
ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), 
false);

Review comment:
       Ok, given the race condition I guess I have to revert this behavior so 
queries use the fake capabilities and to fix the bug for segment metadata based 
schema discovery to plumb special the actual capabilities to segment metadata 
queries. Maybe longer term it would be better if we could snapshot the 
capabilities at the time we query and make a cursor to only read the rows that 
were available at the time the capabilities were fetched and give those to the 
string indexer so that we can have more performant queries on incremental 
indexes at the cost of missing a few rows.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Reply via email to