[GitHub] [druid] jihoonson commented on a change in pull request #10288: Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided

GitBox Tue, 22 Sep 2020 16:03:22 -0700


jihoonson commented on a change in pull request #10288:
URL: https://github.com/apache/druid/pull/10288#discussion_r493080647




##########
File path: docs/ingestion/index.md
##########
@@ -90,7 +90,7 @@ This table compares the three available options:
 | **Input locations** | Any [`inputSource`](./native-batch.md#input-sources). 
| Any Hadoop FileSystem or Druid datasource. | Any 
[`inputSource`](./native-batch.md#input-sources). |
 | **File formats** | Any [`inputFormat`](./data-formats.md#input-format). | 
Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
 | **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in 
the [`tuningConfig`](native-batch.md#tuningconfig).  | Always perfect. | 
Perfect if `forceGuaranteedRollup` = true in the 
[`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning 
methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) 
for details. | Hash-based or range-based partitioning via 
[`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based 
partitioning methods are available. See [Partitions 
Spec](./native-batch.md#partitionsspec) for details. |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning 
methods are available. See [Partitions Spec](./native-batch.md#partitionsspec) 
for details. | Hash-based or range-based partitioning via 
[`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based 
partitioning methods are available. See [Partitions 
Spec](./native-batch.md#partitionsspec-1) for details. |

Review comment:
       No, the anchor here was wrong before since 
`native-batch.md#partitionsspec` is the partitionsSpec for the parallel task. 
It should be 
https://druid.apache.org/docs/latest/ingestion/native-batch.html#partitionsspec-1
 which is for the simple task.

##########
File path: docs/ingestion/native-batch.md
##########
@@ -260,7 +260,7 @@ The three `partitionsSpec` types have different 
characteristics.
 | PartitionsSpec | Ingestion speed | Partitioning method | Supported rollup 
mode | Segment pruning at query time |
 
|----------------|-----------------|---------------------|-----------------------|-------------------------------|
 | `dynamic` | Fastest  | Partitioning based on number of rows in segment. | 
Best-effort rollup | N/A |
-| `hashed`  | Moderate | Partitioning based on the hash value of partition 
dimensions. This partitioning may reduce your datasource size and query latency 
by improving data locality. See [Partitioning](./index.md#partitioning) for 
more details. | Perfect rollup | The broker can use the partition information 
to prune segments early to speed up queries if `partitionDimensions` is 
explicitly specified during ingestion. Since the broker knows how to hash 
`partitionDimensions` values to locate a segment, given a query including a 
filter on all the `partitionDimensions`, the broker can pick up only the 
segments holding the rows satisfying the filter on `partitionDimensions` for 
query processing. |
+| `hashed`  | Moderate | Partitioning based on the hash value of partition 
dimensions. This partitioning may reduce your datasource size and query latency 
by improving data locality. See [Partitioning](./index.md#partitioning) for 
more details. | Perfect rollup | The broker can use the partition information 
to prune segments early to speed up queries if `partitionDimensions` is 
explicitly specified during ingestion. Since the broker knows how to hash 
`partitionDimensions` values to locate a segment, given a query including a 
filter on all the `partitionDimensions`, the broker can pick up only the 
segments holding the rows satisfying the filter on `partitionDimensions` for 
query processing.<br/><br/>Note that `partitionDimensions` and 
`partitionFunction` must be set to enable segment pruning.|

Review comment:
       You're right. `partitionFunction` cannot be null anymore. Rephrased the 
description.

##########
File path: 
core/src/main/java/org/apache/druid/timeline/partition/HashBasedNumberedShardSpec.java
##########
@@ -298,8 +211,73 @@ private boolean chunkPossibleInDomain(
     return false;
   }
 
-  private static int getBucketIndex(int hash, int numBuckets)
+  /**
+   * Check if the current segment possibly holds records if the values of 
dimensions in {@link #partitionDimensions}
+   * are of {@code partitionDimensionsValues}
+   *
+   * @param hashPartitionFunction     hash function used to create segments at 
ingestion time
+   * @param partitionDimensionsValues An instance of values of dimensions in 
{@link #partitionDimensions}
+   *
+   * @return Whether the current segment possibly holds records for the given 
values of partition dimensions
+   */
+  private boolean isInChunk(HashPartitionFunction hashPartitionFunction, 
Map<String, String> partitionDimensionsValues)
   {
-    return Math.abs(hash % numBuckets);
+    assert !partitionDimensions.isEmpty();
+    List<Object> groupKey = Lists.transform(
+        partitionDimensions,
+        o -> Collections.singletonList(partitionDimensionsValues.get(o))
+    );
+    return hashPartitionFunction.hash(serializeGroupKey(jsonMapper, groupKey), 
numBuckets) == bucketId;
+  }
+
+  /**
+   * Serializes a group key into a byte array. The serialization algorithm can 
affect hash values of partition keys
+   * since {@link HashPartitionFunction#hash} takes the result of this method 
as its input. This means, the returned
+   * byte array should be backwards-compatible in cases where we need to 
modify this method.

Review comment:
       Hmm, good point. Do you see some good use cases for using complex 
dimensions for partitioning? If you want to benefit from secondary partition 
pruning, you will need to have a filter on partition dimensions. I'm not sure 
what the filter on complex dimensions would look like.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jihoonson commented on a change in pull request #10288: Store hash partition function in dataSegment and allow segment pruning only when hash partition function is provided

Reply via email to