[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

kanzhang Tue, 22 Jul 2014 15:29:26 -0700

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/1338#issuecomment-49810291
  
    Major changes for the updated patch.
    
    1. Replaced doctests with standalone tests
    2. Fixed converter for converting BytesWritables and added read/write tests 
for BytesWritable and byte arrays
    3. Added HBase and Cassandra output format and converter examples
    4. I used to inspect array element types and try to convert Object[] to 
array of primitive types whenever possible (so that they get pickled to Python 
arrays, whereas Object[] gets pickled to Python tuples). But I removed that 
code, since I can't determine element types for empty arrays. Users have to 
supply custom converters if they want Java arrays to appear as Python arrays 
(and they can do that since they know their array types a priori).
    5. No out-of-box support for reading/writing arrays, since ArrayWritable 
itself doesn't have a no-arg constructor for creating an empty instance upon 
deserializing. Users need to provide ArrayWritable subtypes. Custom converters 
for converting arrays to suitable ArrayWritable subtypes are also needed when 
writing. When reading, the default converter will convert any custom 
ArrayWritable subtypes to Object[] and they get pickled to Python tuples.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

Reply via email to