[
https://issues.apache.org/jira/browse/BEAM-5439?focusedWorklogId=158135&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-158135
]
ASF GitHub Bot logged work on BEAM-5439:
----------------------------------------
Author: ASF GitHub Bot
Created on: 24/Oct/18 13:19
Start Date: 24/Oct/18 13:19
Worklog Time Spent: 10m
Work Description: jto opened a new pull request #6812: [BEAM-5439] fixes
performance issue in StringUtf8Coder
URL: https://github.com/apache/beam/pull/6812
fixes https://issues.apache.org/jira/browse/BEAM-5439
ping @lukecwik
I looked into the source code for similar uses of `DataInputStream` but it
seems that `ByteStreams.readFully` is used everywhere.
------------------------
Follow this checklist to help us incorporate your contribution quickly and
easily:
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA
issue, if applicable. This will automatically link the pull request to the
issue.
- [ ] If this contribution is large, please file an Apache [Individual
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
It will help us expedite review of your Pull Request if you tag someone
(e.g. `@username`) to look at it.
Post-Commit Tests Status (on master branch)
------------------------------------------------------------------------------------------------
Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
--- | --- | --- | --- | --- | --- | --- | ---
Go | [](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/)
| --- | --- | --- | --- | --- | ---
Java | [](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/)
[](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/)
Python | [](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)
| --- | [](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
</br> [](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
| [](https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/)
| --- | --- | ---
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 158135)
Time Spent: 10m
Remaining Estimate: 0h
> StringUtf8Coder is slower than expected
> ---------------------------------------
>
> Key: BEAM-5439
> URL: https://issues.apache.org/jira/browse/BEAM-5439
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Affects Versions: 2.6.0
> Reporter: Julien Tournay
> Assignee: Julien Tournay
> Priority: Major
> Labels: perfomance
> Time Spent: 10m
> Remaining Estimate: 0h
>
> While working on Scio's next version, I noticed that {{StringUtf8Coder}} is
> slower than expected.
> I wrote a small micro-benchmark using {{jmh}} that serialises a (scala) List
> of a 1000 Strings using a custom {{Coder[List[_]]}}. While profiling it, I
> noticed that a lot of time is spent in
> {{java.io.DataInputStream.<init>(java.io.InputStream)}}.
> Looking into the code for
> {{StringUtf8Coder}}, the {{readString}} method is directly reading bytes. It
> therefore does not seem that a {{DataInputStream}} is necessary.
> I replaced {{StringUtf8Coder}} with a {{Coder[String]}} implementation (in
> Scala), that is essentially the same as {{StringUtf8Coder}} but is not using
> {{DataInputStream}}.
>
> {code:scala}
> private final object ScioStringCoder extends AtomicCoder[String] {
> import org.apache.beam.sdk.util.VarInt
> import java.nio.charset.StandardCharsets
> import org.apache.beam.sdk.values.TypeDescriptor
> import com.google.common.base.Utf8
> def decode(dis: InputStream): String = {
> val len = VarInt.decodeInt(dis)
> if (len < 0) {
> throw new CoderException("Invalid encoded string length: " + len)
> }
> val bytes = new Array[Byte](len)
> dis.read(bytes)
> return new String(bytes, StandardCharsets.UTF_8)
> }
> def encode(value: String, outStream: OutputStream): Unit = {
> val bytes = value.getBytes(StandardCharsets.UTF_8)
> VarInt.encode(bytes.length, outStream)
> outStream.write(bytes)
> }
> override def verifyDeterministic() = ()
> override def consistentWithEquals() = true
> private val TYPE_DESCRIPTOR = new TypeDescriptor[String] {}
> override def getEncodedTypeDescriptor() = TYPE_DESCRIPTOR
> override def getEncodedElementByteSize(value: String) = {
> if (value == null) {
> throw new CoderException("cannot encode a null String")
> }
> val size = Utf8.encodedLength(value)
> VarInt.getLength(size) + size
> }
> }
> {code}
>
> Using that {{Coder}} is about 27% faster than {{StringUtf8Coder}}. I've added
> the jmh output in "Docs Text"
> Is there any particular reason to use {{DataInputStream}} ?
> Do you think we can remove that to make {{StringUtf8Coder}} more efficient ?
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)