[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866714#comment-15866714
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 9:24 PM:
-----------------------------------------------------------------

Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("vector_summary(features).min").as("min(features)"), 
col("vector_summary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?


was (Author: timhunter):
Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), 
col("VectorSummary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?

> MultivariateOnlineSummarizer performance optimization
> -----------------------------------------------------
>
>                 Key: SPARK-19208
>                 URL: https://issues.apache.org/jira/browse/SPARK-19208
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: zhengruifeng
>         Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to