[
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15866714#comment-15866714
]
Timothy Hunter commented on SPARK-19208:
----------------------------------------
Thanks for the clarification [~mlnick]. I was a bit unclear in my previous
comment. What I meant by catalyst rules is supporting the case in which the
user would naturally request multiple summaries:
{code}
val summaryDF = df.select(VectorSummary.min("features"),
VectorSummary.variance("features"))
{code}
and have a simple rule that rewrites this logical tree to use a single UDAF
under the hood:
{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"),
col("VectorSummary(features).variance").as("variance(features)")
{code}
Of course this is more advanced, and we should probably start with:
- a UDAF that follows some builder pattern such as
VectorSummarizer.metrics("min", "max").summary("features")
- some simple wrappers that (inefficiently) compute independently their
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.
What do you think?
> MultivariateOnlineSummarizer performance optimization
> -----------------------------------------------------
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data:
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
> 748401 instances, and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
> private var currM2n: Array[Double] = _
> private var currM2: Array[Double] = _
> private var currL1: Array[Double] = _
> private var totalCnt: Long = 0
> private var totalWeightSum: Double = 0.0
> private var weightSquareSum: Double = 0.0
> private var weightSum: Array[Double] = _
> private var nnz: Array[Long] = _
> private var currMax: Array[Double] = _
> private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]