[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

kanzhang Sun, 20 Jul 2014 10:46:23 -0700

Github user kanzhang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1338#discussion_r15150914
  
    --- Diff: docs/programming-guide.md ---
    @@ -403,31 +403,30 @@ PySpark SequenceFile support loads an RDD within 
Java, and pickles the resulting
     <tr><td>BooleanWritable</td><td>bool</td></tr>
     <tr><td>BytesWritable</td><td>bytearray</td></tr>
     <tr><td>NullWritable</td><td>None</td></tr>
    -<tr><td>ArrayWritable</td><td>list of primitives, or tuple of 
objects</td></tr>
    --- End diff --
    
    @mateiz we don't handle arrays currently and this is also the case for 
Scala API. The reason is ArrayWritable class doesn't have a no-arg constructor 
for creating an empty instance upon reading. User needs to create subtypes. 
Although we could add subtypes for handling primitive arrays, that makes Spark 
a dependency for users, which we probably don't want to do.
    
    For conversion between arrays and ArrayWritable subtypes, we can 
deserialize automatically upon reading as long as the subtype is on the class 
path. However, our default converter can't convert arrays to ArrayWritable 
subtypes upon writing since we don't know which subtype to use. User needs to 
specify custom converters. 
    
    We should look into ArrayPrimitiveWritable, which is not available in 
Hadoop v1.0.4.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

Reply via email to