Wail Y. Alkowaileet created ASTERIXDB-2815:
----------------------------------------------
Summary: DISTINCT and multiway join in a subquery computes wrong
result
Key: ASTERIXDB-2815
URL: https://issues.apache.org/jira/browse/ASTERIXDB-2815
Project: Apache AsterixDB
Issue Type: Bug
Reporter: Wail Y. Alkowaileet
Attachments: correct_plan, incorrect_plan
Summary:
The issue happens when you have multiway join queries in a subquery with
DISTINCT. The DISTINCT values might be not computed correctly (see details
below). However, changing the join order fixes the issue.
DDL:
{code:sql}
DROP DATAVERSE checkedtweets_schemaless IF EXISTS;
CREATE DATAVERSE checkedtweets_schemaless;
USE checkedtweets_schemaless;
// Create the data type for all our datasets.
CREATE TYPE GenericType AS {
_id: uuid
};
// Create our datasets.
CREATE DATASET RawTweet(GenericType) PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET User(GenericType) PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET Evidence(GenericType) PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET EvidenceFrom(GenericType) PRIMARY KEY _id AUTOGENERATED;
CREATE DATASET Verification(GenericType) PRIMARY KEY _id AUTOGENERATED;
{code}
Query:
{code:sql}
USE checkedtweets_schemaless;
//To make the plan easier to read
set `compiler.sort.parallel` "false";
SELECT t.id, urls
FROM RawTweet t
LET urls = (SELECT DISTINCT VALUE e.url -- the issue is here
FROM Verification v, v.evidence ve, Evidence e
WHERE t.id = v.tweet_id
AND ve = e.ev_id
)
WHERE array_count(urls) > 5;
{code}
Part of the plan that's not correct (see attached for the full plan):
{code:java}
distinct ([$$63])
-- PRE_SORTED_DISTINCT_BY |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
order (ASC, $$63)
-- STABLE_SORT [$$63(ASC)] |PARTITIONED|
exchange
{code}
The issue is *DISTINCT url* is computed globally not locally for each 'group'
Changing the order of the join would produce the correct plan:
{code:sql}
USE checkedtweets_schemaless;
//To make the plan easier to read
set `compiler.sort.parallel` "false";
SELECT t.id, urls
FROM RawTweet t
LET urls = (SELECT DISTINCT VALUE e.url -- the issue is here
FROM Evidence e, Verification v, v.evidence ve
WHERE t.id = v.tweet_id
AND ve = e.ev_id
)
WHERE array_count(urls) > 5;
{code}
Correct plan:
{code:java}
group by ([$$79 := $$69]) decor ([$$73]) {
aggregate [$$63, $$72] <- [listify($$62),
agg-sql-count($$62)]
-- AGGREGATE |LOCAL|
distinct ([$$62])
-- MICRO_PRE_SORTED_DISTINCT_BY |LOCAL|
order (ASC, $$62)
-- MICRO_STABLE_SORT [$$62(ASC)] |LOCAL|
select (not(is-missing($$78)))
-- STREAM_SELECT |LOCAL|
nested tuple source
-- NESTED_TUPLE_SOURCE |LOCAL|
}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)