[GitHub] [druid] jihoonson commented on a change in pull request #10288: Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided

GitBox Mon, 21 Sep 2020 20:03:42 -0700


jihoonson commented on a change in pull request #10288:
URL: https://github.com/apache/druid/pull/10288#discussion_r492438732




##########
File path: 
core/src/main/java/org/apache/druid/timeline/partition/HashPartitioner.java
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.timeline.partition;
+
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import org.apache.druid.data.input.InputRow;
+import org.apache.druid.data.input.Rows;
+
+import javax.annotation.Nullable;
+import java.util.List;
+
+/**
+ * This class is used for hash partitioning during ingestion. The {@link 
ShardSpecLookup} returned from
+ * {@link #createHashLookup} is used to determine what hash bucket the given 
input row will belong to.
+ *
+ * Note: this class must be used only for ingestion. For segment pruning at 
query time,
+ * {@link HashBasedNumberedShardSpec#partitionFunction} should be used instead.

Review comment:
       Hmm, the reason I added this note was that the partition function could 
be null before and `HashPartitioner` uses a default partition function when 
it's null unlike in query. So, if it was used in a query, then segment pruning 
could be done based on a wrong default. However, now, the partition function 
can never be null in ingestion tasks, `HashPartitioner` doesn't have to be used 
only in ingestion even though it is for now. I just removed this note.

##########
File path: docs/querying/query-context.md
##########
@@ -58,6 +58,7 @@ These parameters apply to all query types.
 
|parallelMergeInitialYieldRows|`druid.processing.merge.task.initialYieldNumRows`|Number
 of rows to yield per ForkJoinPool merge task for parallel result merging on 
the Broker, before forking off a new task to continue merging sequences. See 
[Broker configuration](../configuration/index.html#broker) for more details.|
 
|parallelMergeSmallBatchRows|`druid.processing.merge.task.smallBatchNumRows`|Size
 of result batches to operate on in ForkJoinPool merge tasks for parallel 
result merging on the Broker. See [Broker 
configuration](../configuration/index.html#broker) for more details.|
 |useFilterCNF|`false`| If true, Druid will attempt to convert the query filter 
to Conjunctive Normal Form (CNF). During query processing, columns can be 
pre-filtered by intersecting the bitmap indexes of all values that match the 
eligible filters, often greatly reducing the raw number of rows which need to 
be scanned. But this effect only happens for the top level filter, or 
individual clauses of a top level 'and' filter. As such, filters in CNF 
potentially have a higher chance to utilize a large amount of bitmap indexes on 
string columns during pre-filtering. However, this setting should be used with 
great caution, as it can sometimes have a negative effect on performance, and 
in some cases, the act of computing CNF of a filter can be expensive. We 
recommend hand tuning your filters to produce an optimal form if possible, or 
at least verifying through experimentation that using this parameter actually 
improves your query performance with no ill-effects.|
+|segmentPruning|`true`|Enable segment pruning on the Broker. Segment pruning 
can be applied to only the segments partitioned by hash or range.|

Review comment:
       Good point. I renamed it as `secondaryPartitionPruning` since we might 
want to add tertiary partitioning in the future. Also updated the description.

##########
File path: 
core/src/main/java/org/apache/druid/timeline/partition/HashPartitionFunction.java
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.timeline.partition;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonValue;
+import com.google.common.hash.Hashing;
+import org.apache.druid.java.util.common.StringUtils;
+
+/**
+ * An enum of supported hash partition functions. This enum should be updated 
when we want to use a new function
+ * for hash partitioning. This function is a part of {@link 
HashBasedNumberedShardSpec} which is stored
+ * in the metadata store.

Review comment:
       Sounds good. Added.

##########
File path: 
core/src/main/java/org/apache/druid/timeline/partition/BuildingHashBasedNumberedShardSpec.java
##########
@@ -58,6 +61,7 @@ public BuildingHashBasedNumberedShardSpec(
     this.partitionDimensions = partitionDimensions == null
                                ? 
HashBasedNumberedShardSpec.DEFAULT_PARTITION_DIMENSIONS
                                : partitionDimensions;
+    this.partitionFunction = partitionFunction;

Review comment:
       Good point. I thought it will break some things at first.. But, thinking 
about it more, I think having a default in partitionsSpec will be fine even 
though there are at least 2 issues.
   
   - In auto compaction, the last compaction state of each segment includes the 
partitionsSpec of the compaction task which created the segment. The auto 
compaction searches for all segments which don't have the last compaction state 
matched to the current configuration. So, for the segments compacted before, 
the `HashedPartitionsSpec` can have an empty partition function. This can be 
translated to different default hash functions if we ever change the default 
hash function. Then the auto compaction can silently ignore the changed hash 
function. However, this doesn't seem that bad.
   - In rolling upgrades which replace machines of an old version with ones of 
a new version, there can be a mixed version of middleManagers or indexers. In 
this case, all parallel tasks will be broken as each subtask can use different 
partition functions if we have changed the default partition function between 
those two versions. I think this will be acceptable as long as we announce it 
as a known issue in the release notes. (And we are not probably going to change 
the default partition function in the near future.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jihoonson commented on a change in pull request #10288: Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided

Reply via email to