Uday Babbar created SPARK-25911:
-----------------------------------

             Summary: [spark-ml] Hypothesis testing module
                 Key: SPARK-25911
                 URL: https://issues.apache.org/jira/browse/SPARK-25911
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 3.0.0
            Reporter: Uday Babbar


h2. Why this ticket was created

Feasibility determination of some subset of hypothesis testing module mainly 
along value proposition front and to get a preliminary opinion of how does it 
generally sound. Can work on a more comprehensive proposal if say, it's 
generally agreed upon that including dataframe API for t-test makes sense in 
the o.a.s.ml package. 
h2. Current state

There are some streaming implementation in the o.a.s.mllib module, but there 
are no dataframe APIs for some standard tests (t-test). 
||Test ||Current state||Proposed state||
|t-test (welch's, student)|only streaming |Dataframe API|
|chi-squared|streaming, Dataframe/RDD API present| - |
|ANOVA|-|Dataframe API|
|mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make 
sense to include this)|
h2. Rationale 

The utility of experimentation platforms is pervasive and most of them that 
operate at scale (a large portion of them use spark for offline computation) 
require distributed implementation of hypothesis tests to calculate p-values of 
different metrics/features. These APIs would enable distributed computation of 
the relevant stats and prevent overhead in moving data (or some downstream view 
of it) to a framework where such stats computation is available (R, scipy). 

 
 
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to