gianm commented on a change in pull request #10248:
URL: https://github.com/apache/druid/pull/10248#discussion_r468270282



##########
File path: 
processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -73,16 +73,16 @@ private DimensionHandlerUtils()
   )
   {
     if (capabilities == null) {
-      return new StringDimensionHandler(dimensionName, multiValueHandling, 
true);
+      return new StringDimensionHandler(dimensionName, multiValueHandling, 
true, false);
     }
 
     multiValueHandling = multiValueHandling == null ? 
MultiValueHandling.ofDefault() : multiValueHandling;
 
     if (capabilities.getType() == ValueType.STRING) {
-      if (!capabilities.isDictionaryEncoded()) {
+      if (!capabilities.isDictionaryEncoded().isMaybeTrue()) {

Review comment:
       > This is because unknown would only happen from string dimension 
indexers of the incremental index which ultimately always produce string 
columns with a dictionary, and these handlers this method are making are used 
for things like index merging, etc, and so could safely treat unknown from the 
dim indexer as true because it wants the desired state rather than the current 
state.
   
   Thanks for explaining it. This sounds brittle for the following reason:
   
   1. At the call site here, the check really does mean to require a dictionary.
   2. So logically it should treat unknowns as false (to be safe — if it's not 
sure if a dictionary exists, and it requires one, then it should fail the 
check).
   3. But it sounds like there are cases where an unknown will appear, and we 
"know" that it is really meant to be a true. So we need to treat unknowns as OK 
instead of not OK.
   
   But what if a new case appears in a future patch that isn't (3)? Then this 
check is too lax. And because the new case is likely to come from a new call 
site of `getHandlerFromCapabilities`, whoever adds it will probably not realize 
they need to update the check here.
   
   So instead, would it work to keep the check strict, and resolving unknowns 
to trues at the current call sites? Then, if a new caller is added, it'd have 
to deal with the ambiguity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to