Ophir Yoktan created SPARK-34160:
------------------------------------
Summary: pyspark.ml.stat.Summarizer should allow sparse vector
results
Key: SPARK-34160
URL: https://issues.apache.org/jira/browse/SPARK-34160
Project: Spark
Issue Type: New Feature
Components: ML
Affects Versions: 3.0.1
Reporter: Ophir Yoktan
currently pyspark.ml.stat.Summarizer will return a dense vector, even if the
input is sparse.
the Summarizer should either deduce the relevant type from the input, or add a
parameter that forces sparse output
code to reproduce:
{{import pyspark}}
{{from pyspark.sql.functions import col}}
{{from pyspark.ml.stat import Summarizer}}
{{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc =
pyspark.SparkContext.getOrCreate()}}
{{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ (
SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
{{print(df.head())}}
{{print(df.select(Summarizer.mean(col('v'))).head())}}
ouput:
{{Row(v=SparseVector(100, \{1: 1.0})) }}
{{Row(mean(v)=DenseVector([0.0, 1.0,}}
{{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0]))}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]