Mike Kaplinskiy created BEAM-5775:
-------------------------------------
Summary: Make the spark runner not serialize data unless spark is
spilling to disk
Key: BEAM-5775
URL: https://issues.apache.org/jira/browse/BEAM-5775
Project: Beam
Issue Type: Improvement
Components: runner-spark
Reporter: Mike Kaplinskiy
Assignee: Amit Sela
Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. This
lets Spark keep the data in memory avoiding the serialization round trip.
Unfortunately the logic is fairly coarse - as soon as you switch to
MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen
to keep the data in memory, incurring the serialization overhead.
Ideally Beam would serialize the data lazily - as Spark chooses to spill to
disk. This would be a change in behavior when using beam, but luckily Spark has
a solution for folks that want data serialized in memory - MEMORY_AND_DISK_SER
will keep the data serialized.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)