[ 
https://issues.apache.org/jira/browse/SPARK-8510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Aberline updated SPARK-8510:
----------------------------------
    Description: 
Using the DoubleArrayWritable example, I have added support for storing NumPy 
double arrays and matrices as arrays of doubles and nested arrays of doubles as 
value elements of Sequence Files.

Each value element is a discrete matrix or array. This is useful where you have 
many matrices that you don't want to join into a single Spark Data Frame to 
store in a Parquet file.

Pandas DataFrames can be easily converted to and from NumPy matrices, so I've 
also added the ability to store the schema-less data from DataFrames and Series 
that contain double data. 

There seems to be demand for this functionality:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E

I'll be issuing a PR for this shortly.

  was:
I have extended the provided example code DoubleArrayWritable example to store 
NumPy double type arrays and matrices as arrays of doubles and nested arrays of 
doubles.

Pandas DataFrames can be easily converted to NumPy matrices, so I've also added 
the ability to store the schema-less data from DataFrames and Series that 
contain double data. 

Other than my own use there seems to be demand for this functionality:

http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E

I'll be issuing a PR for this shortly.


> NumPy arrays and matrices as values in sequence files
> -----------------------------------------------------
>
>                 Key: SPARK-8510
>                 URL: https://issues.apache.org/jira/browse/SPARK-8510
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Peter Aberline
>            Priority: Minor
>
> Using the DoubleArrayWritable example, I have added support for storing NumPy 
> double arrays and matrices as arrays of doubles and nested arrays of doubles 
> as value elements of Sequence Files.
> Each value element is a discrete matrix or array. This is useful where you 
> have many matrices that you don't want to join into a single Spark Data Frame 
> to store in a Parquet file.
> Pandas DataFrames can be easily converted to and from NumPy matrices, so I've 
> also added the ability to store the schema-less data from DataFrames and 
> Series that contain double data. 
> There seems to be demand for this functionality:
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E
> I'll be issuing a PR for this shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to