[ 
https://issues.apache.org/jira/browse/CALCITE-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077184#comment-16077184
 ] 

Julian Hyde commented on CALCITE-1616:
--------------------------------------

I spoke about this work at DataWorks Summit: 
https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite. I 
tried to explain the algorithm and motivating cases.

> Data profiler
> -------------
>
>                 Key: CALCITE-1616
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1616
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>
> Profiling looks at a data set and infers characteristics and constraints 
> about the data.
> Some applications:
> * it helps users understand their data,
> * inferred constraints may allow additional optimizations (e.g. a foreign key 
> allows semi-join removal),
> * column statistics help the optimizer estimate the selectivity of filters 
> and joins, 
> * joint cardinalities drive the algorithm that chooses which tiles of a 
> lattice to materialize.
> Imagine you ran a profiler on a data set of 1 million rows and columns 
> [orderId, gender, state, zipcode, productId, productName, brand]. Here is 
> some sample output:
> {noformat}
> {"type": "distribution", "columns": [], "cardinality": 1000000)},
> {"type": "distribution", "columns": ["gender"], "cardinality": 2, nullCount: 
> 0, values: ["F", "M"])},
> {"type": "distribution", "columns": ["state"], "cardinality": 50)},
> {"type": "distribution", "columns": ["zipcode"], "cardinality": 43000, 
> "nullCount": 0)},
> {"type": "distribution", "columns": ["state", "zipcode"], "cardinality": 
> 43419)},
> {"type": "unique", "columns": ["orderId"]},
> {"type": "fd", "columns": ["productId"], "depend": ["brand", "productName"]},
> {noformat}
> Note:
> * the cardinality of 0 columns is the count of the data set;
> * "nullCount" and "values" are only present for distributions of 1 column;
> * "nullCount" may be is omitted if 0;
> * "values" is only present if there are N or fewer values
> * "distribution" of 2 or more columns is only output if it is "interesting"; 
> in the case of ["state", "zipcode"] it is interesting because the joint 
> cardinality 43,419 is fewer than the cardinality 999,982 that would be 
> expected if they were uniform and independent;
> * "fd" and "unique" are minimal. For example, we don't output 
> "unique(orderId, productId)" if we have "unique(orderId)".
> Other ideas:
> * Some measure of skewedness. Does one value occur many more times than 
> others?
> * Don't compute joint distributions for the power set of columns. This 
> requires memory exponential in the number of columns. In pass 1 compute 
> single-column distributions. In pass N compute N-column distributions only 
> for the combinations of columns that the previous pass indicates will be 
> interesting.
> * Use HyperLogLog to compute cardinalities.
> * Add low-cardinality columns to joint distributions. Rather than computing 
> cardinality(zipcode, state) compute cardinality(zipcode, state, gender="F") 
> and cardinality(zipcode, state, gender="M"). Because HLL rolls up losslessly, 
> with 2x the memory you can compute 3 results: cardinality(zipcode, state), 
> cardinality(zipcode, gender), cardinality(state, gender).
> * Approximate histograms: approximate quartiles? Or buckets with exact counts 
> in each range?
> * Allow passing previous results into the algorithm. If you know the previous 
> histogram of the orderDate column it is easier to compute its new histogram 
> than if you start from scratch.
> * HyperLogLog is inaccurate for low cardinalities, so keep all values until 
> the number of values exceeds a threshold, then transition to buckets.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to