Merge pull request #146 from JoshRosen/pyspark-custom-serializers Custom Serializers for PySpark
This pull request adds support for custom serializers to PySpark. For now, all Python-transformed (or parallelize()d RDDs) are serialized with the same serializer that's specified when creating SparkContext. For now, PySpark includes `PickleSerDe` and `MarshalSerDe` classes for using Python's `pickle` and `marshal` serializers. It's pretty easy to add support for other serializers, although I still need to add instructions on this. A few notable changes: - The Scala `PythonRDD` class no longer manipulates Pickled objects; data from `textFile` is written to Python as MUTF-8 strings. The Python code performs the appropriate bookkeeping to track which deserializer should be used when reading an underlying JavaRDD. This mechanism could also be used to support other data exchange formats, such as MsgPack. - Several magic numbers were refactored into constants. - Batching is implemented by wrapping / decorating an unbatched SerDe. Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/fb6875dd Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/fb6875dd Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/fb6875dd Branch: refs/heads/master Commit: fb6875dd5c9334802580155464cef9ac4d4cc1f0 Parents: 330ada1 1b74a27 Author: Matei Zaharia <[email protected]> Authored: Tue Nov 26 20:55:40 2013 -0800 Committer: Matei Zaharia <[email protected]> Committed: Tue Nov 26 20:55:40 2013 -0800 ---------------------------------------------------------------------- .../org/apache/spark/api/python/PythonRDD.scala | 149 +++------ python/epydoc.conf | 2 +- python/pyspark/accumulators.py | 6 +- python/pyspark/context.py | 71 +++-- python/pyspark/rdd.py | 97 +++--- python/pyspark/serializers.py | 301 ++++++++++++++++--- python/pyspark/tests.py | 3 +- python/pyspark/worker.py | 44 ++- python/run-tests | 1 + 9 files changed, 428 insertions(+), 246 deletions(-) ----------------------------------------------------------------------
