[jira] [Updated] (SPARK-24258) SPIP: Improve PySpark support for ML Matrix and Vector types

Leif Walsh (JIRA) Sat, 12 May 2018 15:47:43 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-24258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Leif Walsh updated SPARK-24258:
-------------------------------
    Description: 
h1. Background and Motivation:

In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can 
construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and 
{{DenseMatrix}}.  In PySpark, you can construct one of these vectors with 
{{VectorAssembler}}, and then you can run python UDFs on these columns, and use 
{{toArray()}} to get numpy ndarrays and do things with them.  They also have a 
small native API where you can compute {{dot()}}, {{norm()}}, and a few other 
things with them (I think these are computed in scala, not python, could be 
wrong).

For statistical applications, having the ability to manipulate columns of 
matrix and vector values (from here on, I will use the term tensor to refer to 
arrays of arbitrary dimensionality, matrices are 2-tensors and vectors are 
1-tensors) would be powerful.  For example, you could use PySpark to reshape 
your data in parallel, assemble some matrices from your raw data, and then run 
some statistical computation on them using UDFs leveraging python libraries 
like statsmodels, numpy, tensorflow, and scikit-learn.

I propose enriching the {{pyspark.ml.linalg}} types in the following ways:

# Expand the set of column operations one can apply to tensor columns beyond 
the few functions currently available on these types.  Ideally, the API should 
aim to be as wide as the numpy ndarray API, but would wrap Breeze operations.  
For example, we should provide {{DenseVector.outerProduct()}} so that a user 
could write something like {{df.withColumn("XtX", 
df["X"].outerProduct(df["X"]))}}.
# Make sure all ser/de mechanisms (including Arrow) understand these types, and 
faithfully represent them as natural types in all languages (in scala and java, 
Breeze objects, in python, numpy ndarrays rather than the pyspark.ml.linalg 
types that wrap them, in SparkR, I'm not sure what, but something natural) when 
applying UDFs or collecting with {{toPandas()}}.
# Improve the construction of these types from scalar columns.  The 
{{VectorAssembler}} API is not very ergonomic.  I propose something like 
{{df.withColumn("predictors", Vector.of(df["feature1"], df["feature2"], 
df["feature3"]))}}.

h1. Target Personas:

Data scientists, machine learning practitioners, machine learning library 
developers.

h1. Goals:

This would allow users to do more statistical computation in Spark natively, 
and would allow users to apply python statistical computation to data in Spark 
using UDFs.

h1. Non-Goals:

I suppose one non-goal is to reimplement something like statsmodels using 
Breeze data structures and computation.  That could be seen as an effort to 
enrich Spark ML itself, but is out of scope of this effort.  This effort is 
just to make it possible and easy to apply existing python libraries to tensor 
values in parallel.

h1. Proposed API Changes:

Add the above APIs to PySpark and the other language bindings.  I think the 
list is:

# {{pyspark.ml.linalg.Vector.of(*columns)}}
# {{pyspark.ml.linalg.Matrix.of(<not sure what goes here, maybe we don't 
provide this>)}}
# For each of the matrix and vector types in {{pyspark.ml.linalg}}, add more 
methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc.  
https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a 
good list to look at.

Also, change python UDFs so that these tensor types are passed to the python 
function not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap 
{{numpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and 
interpret return values that are {{numpy.ndarray}} objects back into the spark 
types.

  was:
h1. Background and Motivation:

In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can 
construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and 
{{DenseMatrix}}.  In PySpark, you can construct one of these vectors with 
{{VectorAssembler}}, and then you can run python UDFs on these columns, and use 
{{toArray()}} to get numpy ndarrays and do things with them.  They also have a 
small native API where you can compute {{dot()}}, {{norm()}}, and a few other 
things with them (I think these are computed in scala, not python, could be 
wrong).

For statistical applications, having the ability to manipulate columns of 
matrix and vector values (from here on, I will use the term tensor to refer to 
arrays of arbitrary dimensionality, matrices are 2-tensors and vectors are 
1-tensors) would be powerful.  For example, you could use PySpark to reshape 
your data in parallel, assemble some matrices from your raw data, and then run 
some statistical computation on them using UDFs leveraging python libraries 
like statsmodels, numpy, tensorflow, and scikit-learn.

I propose enriching the {{pyspark.ml.linalg}} types in the following ways:

# Expand the set of column operations one can apply to tensor columns beyond 
the few functions currently available on these types.  Ideally, the API should 
aim to be as wide as the numpy ndarray API, but would wrap Breeze operations.  
For example, we should provide {{DenseVector.outerProduct()}} so that a user 
could write something like {{df.withColumn("XtX", 
df["X"].outerProduct(df["X"]))}}.
# Make sure all ser/de mechanisms (including Arrow) understand these types, and 
faithfully represent them as natural types in all languages (in scala and java, 
Breeze objects, in python, numpy ndarrays rather than the pyspark.ml.linalg 
types that wrap them, in SparkR, I'm not sure what, but something natural) when 
applying UDFs or collecting with {{toPandas()}}.
# Improve the construction of these types from scalar columns.  The 
{{VectorAssembler}} API is not very ergonomic.  I propose something like 
{{df.withColumn("predictors", Vector.of(df["feature1"], df["feature2"], 
df["feature3"]))}}.

h1. Target Personas:

Data scientists, machine learning practitioners, machine learning library 
developers.

h1. Goals:

This would allow users to do more statistical computation in Spark natively, 
and would allow users to apply python statistical computation to data in Spark 
using UDFs.

h1. Non-Goals:

I suppose one non-goal is to reimplement something like statsmodels using 
Breeze data structures and computation.  That could be seen as an effort to 
enrich Spark ML itself, but is out of scope of this effort.  This effort is 
just to make it possible and easy to apply existing python libraries to tensor 
values in parallel.

h1. Proposed API Changes:

Add the above APIs to PySpark and the other language bindings.  I think the 
list is:

# {{pyspark.ml.linalg.Vector.of(*columns)}}
# {{pyspark.ml.linalg.Matrix.of(<not sure what goes here, maybe we don't 
provide this>)}}
# For each of the matrix and vector types in {{pyspark.ml.linalg}}, add more 
methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc.  
https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a 
good list to look at.

Also, change python UDFs so that these tensor types are passed to the python 
function not as {Sparse,Dense}{Matrix,Vector} objects that wrap 
{{numpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and 
interpret return values that are {{numpy.ndarray}} objects back into the spark 
types.


> SPIP: Improve PySpark support for ML Matrix and Vector types
> ------------------------------------------------------------
>
>                 Key: SPARK-24258
>                 URL: https://issues.apache.org/jira/browse/SPARK-24258
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, PySpark
>    Affects Versions: 2.3.0
>            Reporter: Leif Walsh
>            Priority: Major
>
> h1. Background and Motivation:
> In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can 
> construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and 
> {{DenseMatrix}}.  In PySpark, you can construct one of these vectors with 
> {{VectorAssembler}}, and then you can run python UDFs on these columns, and 
> use {{toArray()}} to get numpy ndarrays and do things with them.  They also 
> have a small native API where you can compute {{dot()}}, {{norm()}}, and a 
> few other things with them (I think these are computed in scala, not python, 
> could be wrong).
> For statistical applications, having the ability to manipulate columns of 
> matrix and vector values (from here on, I will use the term tensor to refer 
> to arrays of arbitrary dimensionality, matrices are 2-tensors and vectors are 
> 1-tensors) would be powerful.  For example, you could use PySpark to reshape 
> your data in parallel, assemble some matrices from your raw data, and then 
> run some statistical computation on them using UDFs leveraging python 
> libraries like statsmodels, numpy, tensorflow, and scikit-learn.
> I propose enriching the {{pyspark.ml.linalg}} types in the following ways:
> # Expand the set of column operations one can apply to tensor columns beyond 
> the few functions currently available on these types.  Ideally, the API 
> should aim to be as wide as the numpy ndarray API, but would wrap Breeze 
> operations.  For example, we should provide {{DenseVector.outerProduct()}} so 
> that a user could write something like {{df.withColumn("XtX", 
> df["X"].outerProduct(df["X"]))}}.
> # Make sure all ser/de mechanisms (including Arrow) understand these types, 
> and faithfully represent them as natural types in all languages (in scala and 
> java, Breeze objects, in python, numpy ndarrays rather than the 
> pyspark.ml.linalg types that wrap them, in SparkR, I'm not sure what, but 
> something natural) when applying UDFs or collecting with {{toPandas()}}.
> # Improve the construction of these types from scalar columns.  The 
> {{VectorAssembler}} API is not very ergonomic.  I propose something like 
> {{df.withColumn("predictors", Vector.of(df["feature1"], df["feature2"], 
> df["feature3"]))}}.
> h1. Target Personas:
> Data scientists, machine learning practitioners, machine learning library 
> developers.
> h1. Goals:
> This would allow users to do more statistical computation in Spark natively, 
> and would allow users to apply python statistical computation to data in 
> Spark using UDFs.
> h1. Non-Goals:
> I suppose one non-goal is to reimplement something like statsmodels using 
> Breeze data structures and computation.  That could be seen as an effort to 
> enrich Spark ML itself, but is out of scope of this effort.  This effort is 
> just to make it possible and easy to apply existing python libraries to 
> tensor values in parallel.
> h1. Proposed API Changes:
> Add the above APIs to PySpark and the other language bindings.  I think the 
> list is:
> # {{pyspark.ml.linalg.Vector.of(*columns)}}
> # {{pyspark.ml.linalg.Matrix.of(<not sure what goes here, maybe we don't 
> provide this>)}}
> # For each of the matrix and vector types in {{pyspark.ml.linalg}}, add more 
> methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc.  
> https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a 
> good list to look at.
> Also, change python UDFs so that these tensor types are passed to the python 
> function not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap 
> {{numpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and 
> interpret return values that are {{numpy.ndarray}} objects back into the 
> spark types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-24258) SPIP: Improve PySpark support for ML Matrix and Vector types

Reply via email to