Ali Alsuliman created ASTERIXDB-3580:
----------------------------------------
Summary: Dataset partitioning property should be hash-partitioned
with partitions map
Key: ASTERIXDB-3580
URL: https://issues.apache.org/jira/browse/ASTERIXDB-3580
Project: Apache AsterixDB
Issue Type: Bug
Reporter: Ali Alsuliman
Assignee: Ali Alsuliman
The dataset partitioning property has been changed to “randomly partitioned” as
part of the compute-storage separation work. One issue is that sometimes “hash
exchanges” are introduced in the plan (unnecessarily) because of the random
partitioning delivered by the dataset. The dataset delivered partitioning
property can be “hash partitioned” with the partitions map. This way if an
operator like a join operator requires a hash partitioning with the partitions
map, the dataset delivered partitioning property will satisfy the requirement
and no hash exchanges are introduced.
For example, the following query:
{code:java}
SELECT VALUE c1 FROM c1
WHERE c1.x NOT IN (
SELECT VALUE c2.x
FROM c2); {code}
Has HASH_PARTITION_EXCHANGE [$$33] which is not needed:
{code:java}
distribute result [$$c1]
-- DISTRIBUTE_RESULT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
project ([$$c1])
-- STREAM_PROJECT |PARTITIONED|
select ($$32)
-- STREAM_SELECT |PARTITIONED|
project ([$$32, $$c1])
-- STREAM_PROJECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
group by ([$$37 := $$33]) decor ([$$c1]) {
aggregate [$$32] <- [empty-stream()]
-- AGGREGATE |LOCAL|
select (not(is-missing($$36)))
-- STREAM_SELECT |LOCAL|
nested tuple source
-- NESTED_TUPLE_SOURCE |LOCAL|
}
-- PRE_CLUSTERED_GROUP_BY[$$33] |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
order (ASC, $$33)
-- STABLE_SORT [$$33(ASC)] |PARTITIONED|
exchange
-- HASH_PARTITION_EXCHANGE [$$33] |PARTITIONED|
project ([$$c1, $$36, $$33])
-- STREAM_PROJECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
left outer join (not(if-missing-or-null(neq($$35,
$$28), false)))
-- NESTED_LOOP |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
assign [$$35] <- [$$c1.getField("x")]
-- ASSIGN |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
data-scan []<-[$$33, $$c1] <- Default.c1
-- DATASOURCE_SCAN |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
empty-tuple-source
-- EMPTY_TUPLE_SOURCE |PARTITIONED|
exchange
-- BROADCAST_EXCHANGE |PARTITIONED|
project ([$$36, $$28])
-- STREAM_PROJECT |PARTITIONED|
assign [$$36, $$28] <- [true, $$c2.getField("x")]
-- ASSIGN |PARTITIONED|
project ([$$c2])
-- STREAM_PROJECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
data-scan []<-[$$34, $$c2] <- Default.c2
project ({x:any})
-- DATASOURCE_SCAN |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
empty-tuple-source
-- EMPTY_TUPLE_SOURCE
|PARTITIONED|{code}
The HASH_PARTITION_EXCHANGE [$$33] should be -- ONE_TO_ONE_EXCHANGE after
fixing this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)