[
https://issues.apache.org/jira/browse/BEAM-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17122233#comment-17122233
]
Kenneth Knowles commented on BEAM-5775:
---------------------------------------
This issue is assigned but has not received an update in 30 days so it has been
labeled "stale-assigned". If you are still working on the issue, please give an
update and remove the label. If you are no longer working on the issue, please
unassign so someone else may work on it. In 7 days the issue will be
automatically unassigned.
> Make the spark runner not serialize data unless spark is spilling to disk
> -------------------------------------------------------------------------
>
> Key: BEAM-5775
> URL: https://issues.apache.org/jira/browse/BEAM-5775
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Reporter: Mike Kaplinskiy
> Assignee: Mike Kaplinskiy
> Priority: P3
> Labels: stale-assigned
> Time Spent: 12h 20m
> Remaining Estimate: 0h
>
> Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data.
> This lets Spark keep the data in memory avoiding the serialization round
> trip. Unfortunately the logic is fairly coarse - as soon as you switch to
> MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen
> to keep the data in memory, incurring the serialization overhead.
>
> Ideally Beam would serialize the data lazily - as Spark chooses to spill to
> disk. This would be a change in behavior when using beam, but luckily Spark
> has a solution for folks that want data serialized in memory -
> MEMORY_AND_DISK_SER will keep the data serialized.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)