Mike Dusenberry created SPARK-19653:
---------------------------------------
Summary: `Vector` Type Should Be A First-Class Citizen In Spark SQL
Key: SPARK-19653
URL: https://issues.apache.org/jira/browse/SPARK-19653
Project: Spark
Issue Type: Improvement
Components: ML, MLlib, SQL
Affects Versions: 2.1.0, 2.2.0
Reporter: Mike Dusenberry
*Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally
"Spark ML") should be added as a first-class citizen to Spark SQL.
*Current Status*: Currently, Spark MLlib adds a [{{Vector}} SQL datatype |
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary for
MLlib algorithms. Although this allows a DataFrame/DataSet to contain vectors,
it does not allow one to make complete use of the rich set of features made
available by Spark SQL. For example, it is not possible to use any of the SQL
functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} column, nor is it
possible to save a DataFrame with a {{Vector}} column as a CSV file. In any of
these cases, an error message is returned with an note that the operator is not
supported on a {{Vector}} type.
*Benefit*: Allow users to make use of all Spark SQL features that can be
reasonably applied to a vector.
*Goal*: Move the {{Vector}} type from Spark MLlib into Spark SQL as a
first-class citizen.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]