[ 
https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-34160.
----------------------------------
    Resolution: Not A Problem

> pyspark.ml.stat.Summarizer should allow sparse vector results
> -------------------------------------------------------------
>
>                 Key: SPARK-34160
>                 URL: https://issues.apache.org/jira/browse/SPARK-34160
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>    Affects Versions: 3.0.1
>            Reporter: Ophir Yoktan
>            Priority: Major
>
> currently pyspark.ml.stat.Summarizer will return a dense vector, even if the 
> input is sparse.
> the Summarizer should either deduce the relevant type from the input, or add 
> a parameter that forces sparse output
> code to reproduce:
> {{import pyspark}}
> {{from pyspark.sql.functions import col}}
> {{from pyspark.ml.stat import Summarizer}}
> {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = 
> pyspark.SparkContext.getOrCreate()}}
> {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( 
> SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
> {{print(df.head())}}
> {{print(df.select(Summarizer.mean(col('v'))).head())}}
> ouput:
> {{Row(v=SparseVector(100, \{1: 1.0})) }}
> {{Row(mean(v)=DenseVector([0.0, 1.0,}}
> {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to