jihoonson commented on a change in pull request #10288:
URL: https://github.com/apache/druid/pull/10288#discussion_r493080647
##########
File path: docs/ingestion/index.md
##########
@@ -90,7 +90,7 @@ This table compares the three available options:
| **Input locations** | Any [`inputSource`](./native-batch.md#input-sources).
| Any Hadoop FileSystem or Druid datasource. | Any
[`inputSource`](./native-batch.md#input-sources). |
| **File formats** | Any [`inputFormat`](./data-formats.md#input-format). |
Any Hadoop InputFormat. | Any [`inputFormat`](./data-formats.md#input-format). |
| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in
the [`tuningConfig`](native-batch.md#tuningconfig). | Always perfect. |
Perfect if `forceGuaranteedRollup` = true in the
[`tuningConfig`](native-batch.md#tuningconfig). |
-| **Partitioning options** | Dynamic, hash-based, and range-based partitioning
methods are available. See [Partitions Spec](./native-batch.md#partitionsspec)
for details. | Hash-based or range-based partitioning via
[`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based
partitioning methods are available. See [Partitions
Spec](./native-batch.md#partitionsspec) for details. |
+| **Partitioning options** | Dynamic, hash-based, and range-based partitioning
methods are available. See [Partitions Spec](./native-batch.md#partitionsspec)
for details. | Hash-based or range-based partitioning via
[`partitionsSpec`](hadoop.md#partitionsspec). | Dynamic and hash-based
partitioning methods are available. See [Partitions
Spec](./native-batch.md#partitionsspec-1) for details. |
Review comment:
No, the anchor here was wrong before since
`native-batch.md#partitionsspec` is the partitionsSpec for the parallel task.
It should be
https://druid.apache.org/docs/latest/ingestion/native-batch.html#partitionsspec-1
which is for the simple task.
##########
File path: docs/ingestion/native-batch.md
##########
@@ -260,7 +260,7 @@ The three `partitionsSpec` types have different
characteristics.
| PartitionsSpec | Ingestion speed | Partitioning method | Supported rollup
mode | Segment pruning at query time |
|----------------|-----------------|---------------------|-----------------------|-------------------------------|
| `dynamic` | Fastest | Partitioning based on number of rows in segment. |
Best-effort rollup | N/A |
-| `hashed` | Moderate | Partitioning based on the hash value of partition
dimensions. This partitioning may reduce your datasource size and query latency
by improving data locality. See [Partitioning](./index.md#partitioning) for
more details. | Perfect rollup | The broker can use the partition information
to prune segments early to speed up queries if `partitionDimensions` is
explicitly specified during ingestion. Since the broker knows how to hash
`partitionDimensions` values to locate a segment, given a query including a
filter on all the `partitionDimensions`, the broker can pick up only the
segments holding the rows satisfying the filter on `partitionDimensions` for
query processing. |
+| `hashed` | Moderate | Partitioning based on the hash value of partition
dimensions. This partitioning may reduce your datasource size and query latency
by improving data locality. See [Partitioning](./index.md#partitioning) for
more details. | Perfect rollup | The broker can use the partition information
to prune segments early to speed up queries if `partitionDimensions` is
explicitly specified during ingestion. Since the broker knows how to hash
`partitionDimensions` values to locate a segment, given a query including a
filter on all the `partitionDimensions`, the broker can pick up only the
segments holding the rows satisfying the filter on `partitionDimensions` for
query processing.<br/><br/>Note that `partitionDimensions` and
`partitionFunction` must be set to enable segment pruning.|
Review comment:
You're right. `partitionFunction` cannot be null anymore. Rephrased the
description.
##########
File path:
core/src/main/java/org/apache/druid/timeline/partition/HashBasedNumberedShardSpec.java
##########
@@ -298,8 +211,73 @@ private boolean chunkPossibleInDomain(
return false;
}
- private static int getBucketIndex(int hash, int numBuckets)
+ /**
+ * Check if the current segment possibly holds records if the values of
dimensions in {@link #partitionDimensions}
+ * are of {@code partitionDimensionsValues}
+ *
+ * @param hashPartitionFunction hash function used to create segments at
ingestion time
+ * @param partitionDimensionsValues An instance of values of dimensions in
{@link #partitionDimensions}
+ *
+ * @return Whether the current segment possibly holds records for the given
values of partition dimensions
+ */
+ private boolean isInChunk(HashPartitionFunction hashPartitionFunction,
Map<String, String> partitionDimensionsValues)
{
- return Math.abs(hash % numBuckets);
+ assert !partitionDimensions.isEmpty();
+ List<Object> groupKey = Lists.transform(
+ partitionDimensions,
+ o -> Collections.singletonList(partitionDimensionsValues.get(o))
+ );
+ return hashPartitionFunction.hash(serializeGroupKey(jsonMapper, groupKey),
numBuckets) == bucketId;
+ }
+
+ /**
+ * Serializes a group key into a byte array. The serialization algorithm can
affect hash values of partition keys
+ * since {@link HashPartitionFunction#hash} takes the result of this method
as its input. This means, the returned
+ * byte array should be backwards-compatible in cases where we need to
modify this method.
Review comment:
Hmm, good point. Do you see some good use cases for using complex
dimensions for partitioning? If you want to benefit from secondary partition
pruning, you will need to have a filter on partition dimensions. I'm not sure
what the filter on complex dimensions would look like.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]