[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

kanzhang Fri, 25 Jul 2014 10:59:42 -0700

Github user kanzhang commented on the pull request:

    https://github.com/apache/spark/pull/1338#issuecomment-50183228
  
    >  it might mean that we're reusing the Bean object on the Java side when 
we read from the InputFormat. Hadoop's RecordReaders actually reuse the same 
object as you read data, so if you want to hold onto multiple data items, you 
need to clone each one. This may be a fair bit of trouble unfortunately.
    
    That should explain it. I recall having to do a similar clone to fix 
reading BytesWritable in this patch.
    
    There's an additional wrinkle to allowing users to set batch size (which I 
previously raised in #755). Currently, we ignore batch size when checking if 2 
serializers are equal and when they are considered equal, re-serialization is a 
no-op. If we expose batch size to users, we have to honor it. In that case, do 
we still want to keep current behavior of ```union```, where it doesn't 
proactively re-serialize if 2 rdd's serializers differ only in batch size? 
Adding to the confusion is current default batch size set in the SparkContext 
is 1000, which differs from the 10 that we pick for ```saveAsPickleFile``` and 
```SchemaRDD.javaToPython```. I don't know the reason why 1000 was chosen for 
SparkContext. Was it for spark's own objects, not user objects? Would a single 
default batch size work for both cases? As long as we can get a consensus on 
these, I'm happy to sort things out either in this patch or a follow-up.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

Reply via email to