[ 
https://issues.apache.org/jira/browse/BEAM-5439?focusedWorklogId=158135&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-158135
 ]

ASF GitHub Bot logged work on BEAM-5439:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/Oct/18 13:19
            Start Date: 24/Oct/18 13:19
    Worklog Time Spent: 10m 
      Work Description: jto opened a new pull request #6812: [BEAM-5439] fixes 
performance issue in StringUtf8Coder
URL: https://github.com/apache/beam/pull/6812
 
 
   fixes https://issues.apache.org/jira/browse/BEAM-5439
   ping @lukecwik 
   
   I looked into the source code for similar uses of `DataInputStream`  but it 
seems that `ByteStreams.readFully` is used everywhere.
   
   ------------------------
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
    - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
    - [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   It will help us expedite review of your Pull Request if you tag someone 
(e.g. `@username`) to look at it.
   
   Post-Commit Tests Status (on master branch)
   
------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/lastCompletedBuild/)
 | --- | --- | --- | --- | --- | ---
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Gradle/lastCompletedBuild/)
 [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza_Gradle/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark_Gradle/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
 </br> [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/)
 | --- | --- | ---
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 158135)
            Time Spent: 10m
    Remaining Estimate: 0h

> StringUtf8Coder is slower than expected
> ---------------------------------------
>
>                 Key: BEAM-5439
>                 URL: https://issues.apache.org/jira/browse/BEAM-5439
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>    Affects Versions: 2.6.0
>            Reporter: Julien Tournay
>            Assignee: Julien Tournay
>            Priority: Major
>              Labels: perfomance
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> While working on Scio's next version, I noticed that {{StringUtf8Coder}} is 
> slower than expected.
> I wrote a small micro-benchmark using {{jmh}} that serialises a (scala) List 
> of a 1000 Strings using a custom {{Coder[List[_]]}}. While profiling it, I 
> noticed that a lot of time is spent in 
> {{java.io.DataInputStream.<init>(java.io.InputStream)}}.
> Looking into the code for 
> {{StringUtf8Coder}}, the {{readString}} method is directly reading bytes. It 
> therefore does not seem that a {{DataInputStream}} is necessary.
> I replaced {{StringUtf8Coder}} with a {{Coder[String]}} implementation (in 
> Scala), that is essentially the same as {{StringUtf8Coder}} but is not using 
> {{DataInputStream}}.
>  
> {code:scala}
> private final object ScioStringCoder extends AtomicCoder[String] {
>   import org.apache.beam.sdk.util.VarInt
>   import java.nio.charset.StandardCharsets
>   import org.apache.beam.sdk.values.TypeDescriptor
>   import com.google.common.base.Utf8
>   def decode(dis: InputStream): String = {
>     val len = VarInt.decodeInt(dis)
>     if (len < 0) {
>       throw new CoderException("Invalid encoded string length: " + len)
>     }
>     val bytes = new Array[Byte](len)
>     dis.read(bytes)
>     return new String(bytes, StandardCharsets.UTF_8)
>   }
>   def encode(value: String, outStream: OutputStream): Unit = {
>     val bytes = value.getBytes(StandardCharsets.UTF_8)
>     VarInt.encode(bytes.length, outStream)
>     outStream.write(bytes)
>   }
>   override def verifyDeterministic() = ()
>   override def consistentWithEquals() = true
>   private val TYPE_DESCRIPTOR = new TypeDescriptor[String] {}
>   override def getEncodedTypeDescriptor() = TYPE_DESCRIPTOR
>   override def getEncodedElementByteSize(value: String) = {
>     if (value == null) {
>       throw new CoderException("cannot encode a null String")
>     }
>     val size = Utf8.encodedLength(value)
>     VarInt.getLength(size) + size
>   }
> }
> {code}
>  
> Using that {{Coder}} is about 27% faster than {{StringUtf8Coder}}. I've added 
> the jmh output in "Docs Text"
> Is there any particular reason to use {{DataInputStream}} ? 
> Do you think we can remove that to make {{StringUtf8Coder}} more efficient ?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to