[
https://issues.apache.org/jira/browse/BEAM-5775?focusedWorklogId=230558&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-230558
]
ASF GitHub Bot logged work on BEAM-5775:
----------------------------------------
Author: ASF GitHub Bot
Created on: 22/Apr/19 00:19
Start Date: 22/Apr/19 00:19
Worklog Time Spent: 10m
Work Description: mikekap commented on pull request #8371: [BEAM-5775]
Move (most) of the batch spark pipelines' transformations to using lazy
serialization.
URL: https://github.com/apache/beam/pull/8371
This avoids unnecessary serialization. For example, if a groupByKey is
happening & part of the shuffle ends up on the current worker, we'll avoid the
unnecessary serialize/deserialize cycle.
The main actual change in this PR (other than replacing `byte[]` with
`ValueAndCoderSerializable`) is in `GroupNonMergingWindowsFunctions`. The
semantics are slightly different in that we defer to Spark's serializer to
serialize the values. This allows the previous optimization to keep working in
a lazy way - if there are a lot of windows for a single value, Spark *should*
serialize them only once since it's the same reference. In case Kryo is being
used, the option `spark.kryo.referenceTracking` controls this behavior &
defaults to true. For Java serialization, it's the only behavior available.
I didn't touch spark streaming in this PR because I'm not sure how to
address the backwards compatibility problem. Any thoughts there?
R: @iemejia
------------------------
Thank you for your contribution! Follow this checklist to help us
incorporate your contribution quickly and easily:
- [x] [**Choose
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and
mention them in a comment (`R: @username`).
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
| --- | --- | --- | --- | --- | ---
Java | [](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)<br>[](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
<br> [](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
| --- | --- | ---
Pre-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
--- |Java | Python | Go | Website
--- | --- | --- | --- | ---
Non-portable | [](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
Portable | --- | [](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/)
| --- | ---
See
[.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md)
for trigger phrase, status and link of all Jenkins jobs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 230558)
Time Spent: 9h 40m (was: 9.5h)
> Make the spark runner not serialize data unless spark is spilling to disk
> -------------------------------------------------------------------------
>
> Key: BEAM-5775
> URL: https://issues.apache.org/jira/browse/BEAM-5775
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Reporter: Mike Kaplinskiy
> Assignee: Mike Kaplinskiy
> Priority: Minor
> Labels: triaged
> Fix For: 2.13.0
>
> Time Spent: 9h 40m
> Remaining Estimate: 0h
>
> Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data.
> This lets Spark keep the data in memory avoiding the serialization round
> trip. Unfortunately the logic is fairly coarse - as soon as you switch to
> MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen
> to keep the data in memory, incurring the serialization overhead.
>
> Ideally Beam would serialize the data lazily - as Spark chooses to spill to
> disk. This would be a change in behavior when using beam, but luckily Spark
> has a solution for folks that want data serialized in memory -
> MEMORY_AND_DISK_SER will keep the data serialized.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)