Uday Babbar created SPARK-25911:
-----------------------------------
Summary: [spark-ml] Hypothesis testing module
Key: SPARK-25911
URL: https://issues.apache.org/jira/browse/SPARK-25911
Project: Spark
Issue Type: Improvement
Components: ML, MLlib
Affects Versions: 3.0.0
Reporter: Uday Babbar
h2. Why this ticket was created
Feasibility determination of some subset of hypothesis testing module mainly
along value proposition front and to get a preliminary opinion of how does it
generally sound. Can work on a more comprehensive proposal if say, it's
generally agreed upon that including dataframe API for t-test makes sense in
the o.a.s.ml package.
h2. Current state
There are some streaming implementation in the o.a.s.mllib module, but there
are no dataframe APIs for some standard tests (t-test).
||Test ||Current state||Proposed state||
|t-test (welch's, student)|only streaming |Dataframe API|
|chi-squared|streaming, Dataframe/RDD API present| - |
|ANOVA|-|Dataframe API|
|mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make
sense to include this)|
h2. Rationale
The utility of experimentation platforms is pervasive and most of them that
operate at scale (a large portion of them use spark for offline computation)
require distributed implementation of hypothesis tests to calculate p-values of
different metrics/features. These APIs would enable distributed computation of
the relevant stats and prevent overhead in moving data (or some downstream view
of it) to a framework where such stats computation is available (R, scipy).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]