[
https://issues.apache.org/jira/browse/BEAM-5775?focusedWorklogId=155233&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-155233
]
ASF GitHub Bot logged work on BEAM-5775:
----------------------------------------
Author: ASF GitHub Bot
Created on: 17/Oct/18 00:51
Start Date: 17/Oct/18 00:51
Worklog Time Spent: 10m
Work Description: mikekap opened a new pull request #6714: [BEAM-5775]
Spark: implement a custom class to lazily encode values for persistence.
URL: https://github.com/apache/beam/pull/6714
Spark's `StorageLevel` is the preferred mechanism to decide what is
serialized when and where. With this change, Beam respects Spark's wish to keep
data deserialized in memory, even if the storage level *may* swap to disk (e.g.
MEMORY_AND_DISK).
This PR also drive-by fixes using the `MEMORY_ONLY_2` storage level. The
code previously assumed that no serialization was necessary, which isn't
strictly true since the `_2` means "replicate to other nodes" - i.e. serialize
over network.
------------------------
Follow this checklist to help us incorporate your contribution quickly and
easily:
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
It will help us expedite review of your Pull Request if you tag someone
(e.g. `@username`) to look at it.
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/)
| --- | --- | --- | --- | --- | ---
Java | [](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
</br> [](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/)
| --- | --- | ---
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 155233)
Time Spent: 10m
Remaining Estimate: 0h
> Make the spark runner not serialize data unless spark is spilling to disk
> -------------------------------------------------------------------------
>
> Key: BEAM-5775
> URL: https://issues.apache.org/jira/browse/BEAM-5775
> Project: Beam
> Issue Type: Improvement
> Components: runner-spark
> Reporter: Mike Kaplinskiy
> Assignee: Amit Sela
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data.
> This lets Spark keep the data in memory avoiding the serialization round
> trip. Unfortunately the logic is fairly coarse - as soon as you switch to
> MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen
> to keep the data in memory, incurring the serialization overhead.
>
> Ideally Beam would serialize the data lazily - as Spark chooses to spill to
> disk. This would be a change in behavior when using beam, but luckily Spark
> has a solution for folks that want data serialized in memory -
> MEMORY_AND_DISK_SER will keep the data serialized.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)