kfaraz commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r754900489



##########
File path: 
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialDimensionDistributionTask.java
##########
@@ -68,8 +68,9 @@
 {
   public static final String TYPE = "partial_dimension_distribution";
 
-  // Future work: StringDistribution does not handle inserting NULLs. This is 
the same behavior as hadoop indexing.
-  private static final boolean SKIP_NULL = true;
+  // Do not skip nulls as StringDistribution can handle null values.
+  // This behavior is different from hadoop indexing.
+  private static final boolean SKIP_NULL = false;

Review comment:
       I think we should be fine without the flag.
   
   The following effects would be observed on single dim:
   1. Partitioning would now also work on a dimension column that is always 
null, although it will actually create just one partition.
   2. Estimation of partition boundaries will also take into account null 
values. So the algorithm would do a better job of estimating the size of the 
first (it would be closer to the target rows). The sizes of later partitions 
will not be affected.
   
   With the addition of (multi dimension) range partitioning, single dim is 
inevitably being affected as it now goes through the multi dim flow itself. So 
this would only be another part of that overall effect.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to