[GitHub] [flink] shuttie opened a new pull request #10529: [FLINK-15171] [serialization] fix performance regression caused by too many buffer allocations on string serialization

GitBox Wed, 11 Dec 2019 03:07:49 -0800

shuttie opened a new pull request #10529: [FLINK-15171] [serialization] fix 
performance regression caused by too many buffer allocations on string 
serialization
URL: https://github.com/apache/flink/pull/10529
 
 
   ## What is the purpose of the change
   
   [FLINK-14346](https://issues.apache.org/jira/browse/FLINK-14346) Introduced 
a faster implementation for string [de]serialization. But while running the 
flink-benchmarks suite, there was a [performance regression 
found](https://issues.apache.org/jira/browse/FLINK-15171) for almost all 
serialization tests: a significant 10% drop-down for the total job throughput.
   
   Flame graph before the FLINK-14346 was applied:
   
![flink-gc-dec05](https://user-images.githubusercontent.com/999061/70614387-5cc2bd00-1c02-11ea-85b4-27ec8ece72db.png)
   
   Flame graph after the FLINK-14346 was applied:
   
![flink-gc-dec11](https://user-images.githubusercontent.com/999061/70614391-5f251700-1c02-11ea-9a8d-5c5d894692bf.png)
   
   From these almost identical graphs we may notice that the GC spends much 
more time cleaning up the heap with the FLINK-14346 applied.
   
   Running the new and old code with the allocation profiling proved the theory 
with higher allocation rate:
   
   Top allocations, with FLINK-14346:
   ```
          bytes  percent  samples  top
     ----------  -------  -------  ---
     8222540128   32.45%    40779  byte[]
     7509810768   29.64%    37258  char[]
     4320201040   17.05%    21491  java.lang.String
     1667513984    6.58%     8247  org.apache.flink.api.java.tuple.Tuple2
      749432744    2.96%     3711  org.apache.flink.api.java.tuple.Tuple8
      589192264    2.33%     2897  java.lang.String[]
      497193120    1.96%     2458  
org.apache.flink.streaming.runtime.streamrecord.StreamRecord
      478790376    1.89%     2372  org.apache.flink.api.java.tuple.Tuple2[]
      404943784    1.60%     2007  java.lang.ThreadLocal$ThreadLocalMap
      156780240    0.62%      564  java.nio.DirectByteBuffer
   ```
   
   Top allocations, no FLINK-14346:
   ```
          bytes  percent  samples  top
     ----------  -------  -------  ---
     7591122240   29.43%     3271  char[]
     5360582240   20.78%     2243  java.lang.ThreadLocal$ThreadLocalMap
     5147640184   19.96%     2231  java.lang.String
     1758207472    6.82%      765  org.apache.flink.api.java.tuple.Tuple2
     1717572128    6.66%      758  
java.util.concurrent.locks.AbstractQueuedSynchronizer$Node
      891013696    3.45%      380  org.apache.flink.api.java.tuple.Tuple8
      598698832    2.32%      266  java.lang.String[]
      440182240    1.71%      202  
org.apache.flink.streaming.runtime.streamrecord.StreamRecord
      364959680    1.41%      141  org.apache.flink.api.java.tuple.Tuple2[]
   ```
   So almost third of all the allocations made were done for these intermediate 
array buffers.
   
   All the benchmarks posted in the [original 
PR](https://github.com/apache/flink/pull/10358) were done on Ryzen 7 2700 (8 
physical cores), and the CPU used for the `flink-benchmarks` is i7 7700 (4 
physical cores). Also note that almost all the flink-benchmarks use 
parallelism=4, so:
   * new code generated +30% more garbage.
   * as originally performance was measured on a 8 core CPU with only 4 
threads, GC threads were scheduled on the idle cores and didn't interfere with 
the benchmark.
   * on the i7 7700 flink-benchmarks used 4 threads on 4 core CPU with 
additional active GC threads, heavily interfering with the main benchmark.
   * this is the reason why lowering the parallelism for [the 
stringHeavyBenchmark](src/main/java/org/apache/flink/benchmark/SerializationFrameworkMiniBenchmarks.java)
 improved the throughput: it added more space for the GC threads to run.
   
   With this PR we did the following:
   * added a static `ThreadLocal<byte[]>` buffer for short strings smaller than 
1024 characters
   * when the string is short enough, we do not allocate the buffer, but reuse 
the static one, eliminating the allocation completely.
   * for this case we need to always preallocate a small buffer for each worker 
thread, even if there is almost no string serialization at all.
   * for long strings we do a regular allocation as before.
   
   ## Brief change log
     - Add ThreadLocal byte buffer for write and read path for short strings 
instead of allocating it on each invocation.
   
   
   ## Verifying this change
   
   This change is already covered by existing tests, such as 
StringSerializationTest.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
     - The serializers: ( **yes** / no / don't know)
     - The runtime per-record code paths (performance sensitive): (**yes** / no 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] shuttie opened a new pull request #10529: [FLINK-15171] [serialization] fix performance regression caused by too many buffer allocations on string serialization

Reply via email to