[
https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Aberline updated SPARK-8510:
----------------------------------
Description:
Using the DoubleArrayWritable example, I have added support for storing NumPy
double arrays and matrices as arrays of doubles and nested arrays of doubles as
value elements of Sequence Files.
Each value element is a discrete matrix or array. This is useful where you have
many matrices that you don't want to join into a single Spark Data Frame to
store in a Parquet file.
Pandas DataFrames can be easily converted to and from NumPy matrices, so I've
also added the ability to store the schema-less data from DataFrames and Series
that contain double data.
There seems to be demand for this functionality:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
I'll be issuing a PR for this shortly.
was:
I have extended the provided example code DoubleArrayWritable example to store
NumPy double type arrays and matrices as arrays of doubles and nested arrays of
doubles.
Pandas DataFrames can be easily converted to NumPy matrices, so I've also added
the ability to store the schema-less data from DataFrames and Series that
contain double data.
Other than my own use there seems to be demand for this functionality:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
I'll be issuing a PR for this shortly.
> NumPy arrays and matrices as values in sequence files
> -----------------------------------------------------
>
> Key: SPARK-8510
> URL: https://issues.apache.org/jira/browse/SPARK-8510
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Reporter: Peter Aberline
> Priority: Minor
>
> Using the DoubleArrayWritable example, I have added support for storing NumPy
> double arrays and matrices as arrays of doubles and nested arrays of doubles
> as value elements of Sequence Files.
> Each value element is a discrete matrix or array. This is useful where you
> have many matrices that you don't want to join into a single Spark Data Frame
> to store in a Parquet file.
> Pandas DataFrames can be easily converted to and from NumPy matrices, so I've
> also added the ability to store the schema-less data from DataFrames and
> Series that contain double data.
> There seems to be demand for this functionality:
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
> I'll be issuing a PR for this shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]