[jira] [Created] (CALCITE-1616) Data profiler

Julian Hyde (JIRA) Wed, 01 Feb 2017 11:02:20 -0800

Julian Hyde created CALCITE-1616:
------------------------------------

             Summary: Data profiler
                 Key: CALCITE-1616
                 URL: https://issues.apache.org/jira/browse/CALCITE-1616
             Project: Calcite
          Issue Type: Bug
            Reporter: Julian Hyde
            Assignee: Julian Hyde



Profiling looks at a data set and infers characteristics and constraints about 
the data.

Some applications:
* it helps users understand their data,
* inferred constraints may allow additional optimizations (e.g. a foreign key 
allows semi-join removal),
* column statistics help the optimizer estimate the selectivity of filters and 
joins, 
* joint cardinalities drive the algorithm that chooses which tiles of a lattice 
to materialize.

Here is some sample output:

{noformat}
{"type": "distribution", "columns": [], "cardinality": 1000000)},
{"type": "distribution", "columns": ["gender"], "cardinality": 2, nullCount: 0, 
values: ["F", "M"])},
{"type": "distribution", "columns": ["state"], "cardinality": 50)},
{"type": "distribution", "columns": ["zipcode"], "cardinality": 43000, 
"nullCount": 0)},
{"type": "distribution", "columns": ["state", "zipcode"], "cardinality": 
43000)},
{"type": "unique", "columns": ["orderId"]},
{"type": "fd", "columns": ["productId"], "depend": ["brand", "productName"]},
{noformat}

Note:
* the cardinality of 0 columns is the count of the data set;
* "nullCount" and "values" are only present for distributions of 1 column;
* "nullCount" may be is omitted if 0;
* "values" is only present if there are N or fewer values
* "distribution" of 2 or more columns is only output if it is "interesting"; in 
the case of ["state", "zipcode"] it is interesting because the joint 
cardinality 43,400 is fewer than the cardinality 999,982 that would be expected 
if they were uniform and independent;
* "fd" and "unique" are minimal. For example, we don't output unique(orderId, 
productId) if we have unique(orderId).

Other ideas:
* Some measure of skewedness. Does one value occur many more times than others?
* Don't compute joint distributions for the power set of columns. This requires 
memory exponential in the number of columns. In pass 1 compute single-column 
distributions. In pass N compute N-column distributions only for the 
combinations of columns that the previous pass indicates will be interesting.
* Use HyperLogLog to compute cardinalities.
* Add low-cardinality columns to joint distributions. Rather than computing 
cardinality(zipcode, state) compute cardinality(zipcode, state, gender="F") and 
cardinality(zipcode, state, gender="M"). Because HLL rolls up losslessly, with 
2x the memory you can compute 3 results: cardinality(zipcode, state), 
cardinality(zipcode, gender), cardinality(state, gender).
* Approximate histograms: approximate quartiles? Or buckets with exact counts 
in each range?
* Allow passing previous results into the algorithm. If you know the previous 
histogram of the orderDate column it is easier to compute its new histogram 
than if you start from scratch.
* HyperLogLog is inaccurate for low cardinalities, so keep all values until the 
number of values exceeds a threshold, then transition to buckets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (CALCITE-1616) Data profiler

Reply via email to