[GitHub] spark pull request: Spark 8510 numpy sequence files

paberline Wed, 24 Jun 2015 13:41:56 -0700

Github user paberline commented on the pull request:

    https://github.com/apache/spark/pull/6995#issuecomment-115004710
  
    Using the DoubleArrayWritable example, I have added support for storing 
NumPy double arrays and matrices as arrays of doubles and nested arrays of 
doubles as value elements of Sequence Files.
    
    Each value element is a discrete matrix or array. This is useful where you 
have many matrices that you don't want to join into a single Spark Data Frame 
to store in a Parquet file.
    
    Pandas DataFrames can be easily converted to and from NumPy matrices, so 
I've also added the ability to store the schema-less data from DataFrames and 
Series that contain double data.
    
    There seems to be demand for this functionality:
    
    
http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAJQK-mg1PUCc_hkV=q3n-01ioq_pkwe1g-c39ximco3khqn...@mail.gmail.com%3E



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Spark 8510 numpy sequence files

Reply via email to