Peter Aberline created SPARK-8510:
-------------------------------------
Summary: Store and read NumPy arrays and matrices as values in
sequence files
Key: SPARK-8510
URL: https://issues.apache.org/jira/browse/SPARK-8510
Project: Spark
Issue Type: Improvement
Components: PySpark
Reporter: Peter Aberline
Priority: Minor
I have extended the provided example code DoubleArrayWritable example to store
NumPy double type arrays and matrices as arrays of doubles and nested arrays of
doubles.
Pandas DataFrames can be easily converted to NumPy matrices, so I've also added
the ability to store the schema-less data from DataFrames and Series that
contain double data.
Other than my own use there seems to be demand for this functionality:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
I'll be issuing a PR for this shortly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]