kfaraz commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r754900489
##########
File path:
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialDimensionDistributionTask.java
##########
@@ -68,8 +68,9 @@
{
public static final String TYPE = "partial_dimension_distribution";
- // Future work: StringDistribution does not handle inserting NULLs. This is
the same behavior as hadoop indexing.
- private static final boolean SKIP_NULL = true;
+ // Do not skip nulls as StringDistribution can handle null values.
+ // This behavior is different from hadoop indexing.
+ private static final boolean SKIP_NULL = false;
Review comment:
I think we should be fine without the flag.
The following effects would be observed on single dim:
1. Partitioning would now also work on a dimension column that is always
null, although it will actually create just one partition.
2. Estimation of partition boundaries will also take into account null
values. So the algorithm would do a better job of estimating the size of the
first (it would be closer to the target rows). The sizes of later partitions
will not be affected.
With the addition of (multi dimension) range partitioning, single dim is
inevitably being affected as it now goes through the multi dim flow itself. So
this would only be another part of that overall effect.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]