Kryo won’t make a major impact on PySpark because it just stores data as byte[]
objects, which are fast to serialize even with Java. But it may be worth a try
— you would just set spark.serializer and not try to register any classes. What
might make more impact is storing data as MEMORY_ONLY_SER
I'm looking at the Tuning Guide suggestion to use Kryo instead of default
serialization. My questions:
Does pyspark use Java serialization by default, as Scala spark does? If
so, then...
can I use Kryo with pyspark instead? The instructions say I should
register my classes with the Kryo Seriali