Mehnaz Tabassum Mahin created ASTERIXDB-3470:
------------------------------------------------
Summary: CBO estimates incorrect join cardinality with IMDb
datasets
Key: ASTERIXDB-3470
URL: https://issues.apache.org/jira/browse/ASTERIXDB-3470
Project: Apache AsterixDB
Issue Type: Bug
Components: COMP - Compiler
Reporter: Mehnaz Tabassum Mahin
Attachments: join_queries.txt, load_and_analyze.txt, schema.txt
The join cardinalities with IMDb datasets are *much overestimated* and the
inaccuracy increases with the increase in sample sizes.
IMDb dataset: [http://homepages.cwi.nl/~boncz/job/imdb.tgz
|http://homepages.cwi.nl/~boncz/job/imdb.tgz](Please change the corresponding
CSV file name to "keywords.csv")
In the attachment, two 10-datasets join queries and the relevant DDL statements
are given. Both join queries are with 10 IMDb datasets and different
selectivity predicates.
# The actual cardinality of the first join query is {*}1298{*}, where the
estimated ones are:
** with "low" sample: 7.489 x 10{^}10{^}
** with "medium" sample: 5.619 x 10{^}12{^}
** with "high" sample: 6.022 x 10{^}12{^}
# The actual cardinality of the second one is {*}1062{*}, whereas the
estimated ones are:
** with "low" sample: 4.33 x 10{^}8{^}
** with "medium" sample: 1.479 x 10{^}11{^}
** with "high" sample: 1.93 x 10{^}11{^}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)