Mike Dusenberry created SPARK-19653:
---------------------------------------

             Summary: `Vector` Type Should Be A First-Class Citizen In Spark SQL
                 Key: SPARK-19653
                 URL: https://issues.apache.org/jira/browse/SPARK-19653
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib, SQL
    Affects Versions: 2.1.0, 2.2.0
            Reporter: Mike Dusenberry


*Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
"Spark ML") should be added as a first-class citizen to Spark SQL.

*Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
 to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary for 
MLlib algorithms.  Although this allows a DataFrame/DataSet to contain vectors, 
it does not allow one to make complete use of the rich set of features made 
available by Spark SQL.  For example, it is not possible to use any of the SQL 
functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} column, nor is it 
possible to save a DataFrame with a {{Vector}} column as a CSV file.  In any of 
these cases, an error message is returned with an note that the operator is not 
supported on a {{Vector}} type.

*Benefit*: Allow users to make use of all Spark SQL features that can be 
reasonably applied to a vector.

*Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to