[
https://issues.apache.org/jira/browse/CALCITE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077184#comment-16077184
]
Julian Hyde commented on CALCITE-1616:
--------------------------------------
I spoke about this work at DataWorks Summit:
https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite. I
tried to explain the algorithm and motivating cases.
> Data profiler
> -------------
>
> Key: CALCITE-1616
> URL: https://issues.apache.org/jira/browse/CALCITE-1616
> Project: Calcite
> Issue Type: Bug
> Reporter: Julian Hyde
> Assignee: Julian Hyde
>
> Profiling looks at a data set and infers characteristics and constraints
> about the data.
> Some applications:
> * it helps users understand their data,
> * inferred constraints may allow additional optimizations (e.g. a foreign key
> allows semi-join removal),
> * column statistics help the optimizer estimate the selectivity of filters
> and joins,
> * joint cardinalities drive the algorithm that chooses which tiles of a
> lattice to materialize.
> Imagine you ran a profiler on a data set of 1 million rows and columns
> [orderId, gender, state, zipcode, productId, productName, brand]. Here is
> some sample output:
> {noformat}
> {"type": "distribution", "columns": [], "cardinality": 1000000)},
> {"type": "distribution", "columns": ["gender"], "cardinality": 2, nullCount:
> 0, values: ["F", "M"])},
> {"type": "distribution", "columns": ["state"], "cardinality": 50)},
> {"type": "distribution", "columns": ["zipcode"], "cardinality": 43000,
> "nullCount": 0)},
> {"type": "distribution", "columns": ["state", "zipcode"], "cardinality":
> 43419)},
> {"type": "unique", "columns": ["orderId"]},
> {"type": "fd", "columns": ["productId"], "depend": ["brand", "productName"]},
> {noformat}
> Note:
> * the cardinality of 0 columns is the count of the data set;
> * "nullCount" and "values" are only present for distributions of 1 column;
> * "nullCount" may be is omitted if 0;
> * "values" is only present if there are N or fewer values
> * "distribution" of 2 or more columns is only output if it is "interesting";
> in the case of ["state", "zipcode"] it is interesting because the joint
> cardinality 43,419 is fewer than the cardinality 999,982 that would be
> expected if they were uniform and independent;
> * "fd" and "unique" are minimal. For example, we don't output
> "unique(orderId, productId)" if we have "unique(orderId)".
> Other ideas:
> * Some measure of skewedness. Does one value occur many more times than
> others?
> * Don't compute joint distributions for the power set of columns. This
> requires memory exponential in the number of columns. In pass 1 compute
> single-column distributions. In pass N compute N-column distributions only
> for the combinations of columns that the previous pass indicates will be
> interesting.
> * Use HyperLogLog to compute cardinalities.
> * Add low-cardinality columns to joint distributions. Rather than computing
> cardinality(zipcode, state) compute cardinality(zipcode, state, gender="F")
> and cardinality(zipcode, state, gender="M"). Because HLL rolls up losslessly,
> with 2x the memory you can compute 3 results: cardinality(zipcode, state),
> cardinality(zipcode, gender), cardinality(state, gender).
> * Approximate histograms: approximate quartiles? Or buckets with exact counts
> in each range?
> * Allow passing previous results into the algorithm. If you know the previous
> histogram of the orderDate column it is easier to compute its new histogram
> than if you start from scratch.
> * HyperLogLog is inaccurate for low cardinalities, so keep all values until
> the number of values exceeds a threshold, then transition to buckets.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)