Nicholas Chammas created SPARK-19217:
----------------------------------------
Summary: Offer easy cast from vector to array
Key: SPARK-19217
URL: https://issues.apache.org/jira/browse/SPARK-19217
Project: Spark
Issue Type: Improvement
Components: ML, PySpark, SQL
Affects Versions: 2.1.0
Reporter: Nicholas Chammas
Priority: Minor
Working with ML often means working with DataFrames with vector columns. You
can't save these DataFrames to storage without converting the vector columns to
array columns, and there doesn't appear to an easy way to make that conversion.
This is a common enough problem that it is [documented on Stack
Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions to
making the conversion from a vector column to an array column are:
# Convert the DataFrame to an RDD and back
# Use a UDF
Both approaches work fine, but it really seems like you should be able to do
something like this instead:
{code}
(le_data
.select(
col('features').cast('array').alias('features')
))
{code}
We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears that
{{cast()}} doesn't support this conversion.
Would this be an appropriate thing to add?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]