Mehnaz Tabassum Mahin created ASTERIXDB-3470:
------------------------------------------------

             Summary: CBO estimates incorrect join cardinality with IMDb 
datasets
                 Key: ASTERIXDB-3470
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-3470
             Project: Apache AsterixDB
          Issue Type: Bug
          Components: COMP - Compiler
            Reporter: Mehnaz Tabassum Mahin
         Attachments: join_queries.txt, load_and_analyze.txt, schema.txt

The join cardinalities with IMDb datasets are *much overestimated* and the 
inaccuracy increases with the increase in sample sizes.

IMDb dataset: [http://homepages.cwi.nl/~boncz/job/imdb.tgz  
|http://homepages.cwi.nl/~boncz/job/imdb.tgz](Please change the corresponding 
CSV file name to "keywords.csv")

In the attachment, two 10-datasets join queries and the relevant DDL statements 
are given. Both join queries are with 10 IMDb datasets and different 
selectivity predicates.
 # The actual cardinality of the first join query is {*}1298{*}, where the 
estimated ones are:
 ** with "low" sample: 7.489 x 10{^}10{^}
 ** with "medium" sample: 5.619 x 10{^}12{^}
 ** with "high" sample: 6.022 x 10{^}12{^}
 # The actual cardinality of the second one is {*}1062{*}, whereas the 
estimated ones are:
 ** with "low" sample: 4.33 x 10{^}8{^}
 ** with "medium" sample: 1.479 x 10{^}11{^}
 ** with "high" sample: 1.93 x 10{^}11{^}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to