[
https://issues.apache.org/jira/browse/ASTERIXDB-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wail Y. Alkowaileet reassigned ASTERIXDB-2815:
----------------------------------------------
Assignee: Wail Y. Alkowaileet
> DISTINCT and multiway join in a subquery computes wrong result
> --------------------------------------------------------------
>
> Key: ASTERIXDB-2815
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-2815
> Project: Apache AsterixDB
> Issue Type: Bug
> Reporter: Wail Y. Alkowaileet
> Assignee: Wail Y. Alkowaileet
> Priority: Major
> Attachments: correct_plan, incorrect_plan
>
>
> Summary:
> The issue happens when you have multiway join queries in a subquery with
> DISTINCT. The DISTINCT values might be not computed correctly (see details
> below). However, changing the join order fixes the issue.
> DDL:
> {code:sql}
> DROP DATAVERSE checkedtweets_schemaless IF EXISTS;
> CREATE DATAVERSE checkedtweets_schemaless;
> USE checkedtweets_schemaless;
> // Create the data type for all our datasets.
> CREATE TYPE GenericType AS {
> _id: uuid
> };
> // Create our datasets.
> CREATE DATASET RawTweet(GenericType) PRIMARY KEY _id AUTOGENERATED;
> CREATE DATASET User(GenericType) PRIMARY KEY _id AUTOGENERATED;
> CREATE DATASET Evidence(GenericType) PRIMARY KEY _id AUTOGENERATED;
> CREATE DATASET EvidenceFrom(GenericType) PRIMARY KEY _id AUTOGENERATED;
> CREATE DATASET Verification(GenericType) PRIMARY KEY _id AUTOGENERATED;
> {code}
>
> Query:
> {code:sql}
> USE checkedtweets_schemaless;
> //To make the plan easier to read
> set `compiler.sort.parallel` "false";
> SELECT t.id, urls
> FROM RawTweet t
> LET urls = (SELECT DISTINCT VALUE e.url -- the issue is here
> FROM Verification v, v.evidence ve, Evidence e
> WHERE t.id = v.tweet_id
> AND ve = e.ev_id
> )
> WHERE array_count(urls) > 5;
> {code}
> Part of the plan that's not correct (see attached for the full plan):
> {code:java}
> distinct ([$$63])
> -- PRE_SORTED_DISTINCT_BY |PARTITIONED|
> exchange
> -- ONE_TO_ONE_EXCHANGE |PARTITIONED|
> order (ASC, $$63)
> -- STABLE_SORT [$$63(ASC)] |PARTITIONED|
> exchange
> {code}
> The issue is *DISTINCT url* is computed globally not locally for each 'group'
>
> Changing the order of the join would produce the correct plan:
> {code:sql}
> USE checkedtweets_schemaless;
> //To make the plan easier to read
> set `compiler.sort.parallel` "false";
> SELECT t.id, urls
> FROM RawTweet t
> LET urls = (SELECT DISTINCT VALUE e.url -- the issue is here
> FROM Evidence e, Verification v, v.evidence ve
> WHERE t.id = v.tweet_id
> AND ve = e.ev_id
> )
> WHERE array_count(urls) > 5;
> {code}
> Correct plan:
> {code:java}
> group by ([$$79 := $$69]) decor ([$$73]) {
> aggregate [$$63, $$72] <- [listify($$62),
> agg-sql-count($$62)]
> -- AGGREGATE |LOCAL|
> distinct ([$$62])
> -- MICRO_PRE_SORTED_DISTINCT_BY |LOCAL|
> order (ASC, $$62)
> -- MICRO_STABLE_SORT [$$62(ASC)] |LOCAL|
> select (not(is-missing($$78)))
> -- STREAM_SELECT |LOCAL|
> nested tuple source
> -- NESTED_TUPLE_SOURCE |LOCAL|
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)