Julien Tournay created BEAM-5439:
------------------------------------

             Summary: StringUtf8Coder is slower than expected
                 Key: BEAM-5439
                 URL: https://issues.apache.org/jira/browse/BEAM-5439
             Project: Beam
          Issue Type: Bug
          Components: sdk-java-core
    Affects Versions: 2.6.0
            Reporter: Julien Tournay
            Assignee: Kenneth Knowles


While working on Scio's next version, I noticed that {{StringUtf8Coder}} is 
slower than expected.

I wrote a small micro-benchmark using {{jmh}} that serialises a List of a 1000 
Strings using a custom {{Coder[List[_]]}}. While profiling it, I noticed that a 
lot of time is spent in {{java.io.DataInputStream.<init>(java.io.InputStream)}}.

Looking into the code for 
{{StringUtf8Coder}}, the {{readString}} method is directly reading bytes. It 
therefore does not seem that a {{DataInputStream}} is necessary.
I replaced {{StringUtf8Coder}} with a {{Coder[String]}} implementation (in 
Scala), that is essentially the same as {{StringUtf8Coder}} but is not using 
{{DataInputStream}}.
 
{code:scala}
private final object ScioStringCoder extends AtomicCoder[String] {
  import org.apache.beam.sdk.util.VarInt
  import java.nio.charset.StandardCharsets
  import org.apache.beam.sdk.values.TypeDescriptor
  import com.google.common.base.Utf8

  def decode(dis: InputStream): String = {
    val len = VarInt.decodeInt(dis)
    if (len < 0) {
      throw new CoderException("Invalid encoded string length: " + len)
    }
    val bytes = new Array[Byte](len)
    dis.read(bytes)
    return new String(bytes, StandardCharsets.UTF_8)
  }

  def encode(value: String, outStream: OutputStream): Unit = {
    val bytes = value.getBytes(StandardCharsets.UTF_8)
    VarInt.encode(bytes.length, outStream)
    outStream.write(bytes)
  }

  override def verifyDeterministic() = ()
  override def consistentWithEquals() = true
  private val TYPE_DESCRIPTOR = new TypeDescriptor[String] {}
  override def getEncodedTypeDescriptor() = TYPE_DESCRIPTOR
  override def getEncodedElementByteSize(value: String) = {
    if (value == null) {
      throw new CoderException("cannot encode a null String")
    }
    val size = Utf8.encodedLength(value)
    VarInt.getLength(size) + size
  }
}
{code}
 
Using that {{Coder}} is about 27% faster than {{StringUtf8Coder}}. I've added 
the jmh output in "Docs Text"

Is there any particular reason to use {{DataInputStream}} ? 
Do you think we can remove that to make {{StringUtf8Coder}} more efficient ?
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to