Abacn commented on PR #24633:
URL: https://github.com/apache/beam/pull/24633#issuecomment-1364132421

   Multiple Issues of SDF read tests in both Java and Python xlang when bumping 
up the number of records:
   
   * Java Read -> (without ReShuffle) Count element have duplicates because a 
single kafka read can fail intermittently. Result in count > expected number 
(https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3484/consoleFull).
   <img width="1088" alt="image" 
src="https://user-images.githubusercontent.com/8010435/209372205-16f4d2c7-32a2-4b0a-bc69-a6abc932992e.png";>
   
   This is flaky rather than perma-red (streaming test succeeded in 
https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3482/consoleFull)
   This was not observed previously because the test was run on small dataset.
   
   Note that run 3482 failed because of another flake in batch test. The write 
had intermittent fail and retried, and then cause read hash check fail.
   
   * Java Read -> ReShuffle -> Count. The performance degrades significantly 
with a reshuffle inserted. Previously throughput ~ 100k/s (see above 
screenshot) add a ReShuffle it becomes ~20k/s 
(https://ci-beam.apache.org/job/beam_PerformanceTests_Kafka_IO/3488/console)
   <img width="1085" alt="image" 
src="https://user-images.githubusercontent.com/8010435/209373106-5d8b2469-9c64-45e8-9650-f1d571779a86.png";>
   
   * Python xlang Read -> (without ReShuffle) The pipeline does not scale, only 
one worker active 
(https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_xlang_KafkaIO_Python/5/console)
 and throughput is ~50k/s
   <img width="812" alt="image" 
src="https://user-images.githubusercontent.com/8010435/209373443-5a2927e3-4eb2-436c-a4d2-7b7c40a15127.png";>
   <img width="784" alt="image" 
src="https://user-images.githubusercontent.com/8010435/209373341-0c168c80-9560-43bc-b598-5a56c0669332.png";>
   
   * Python xlang Read -> ReShuffle -> Count. The performance has further 
reduced to only 1k/s throughput 
(https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_xlang_KafkaIO_Python/7/console)
   <img width="986" alt="image" 
src="https://user-images.githubusercontent.com/8010435/209373894-095653f6-b3e0-4294-a833-78c9fa1aa613.png";>
   (note to myself: still need to cancel the streaming pipeline after waituntil 
reached)
   
   * There is still a case that Python read can work, that is do not set 
`--streaming` pipeline option, then the job runs on batch mode (which is wierd 
to me either, ReadFromKafka transform itself does not induce a streaming 
pipeline). The succeeded job is 
https://ci-beam.apache.org/view/PerformanceTests/job/beam_PerformanceTests_xlang_KafkaIO_Python/3/console
 with a pretty high throughput.
   
   ---------
   
   Based on these attempts, Summary of change to make test work after bumping 
num_records
   
   * Prune kafka topic after test done. Otherwise the second test fail with 
kafka server run out of source.
   * For write pipeline, add a ReShuffle after generate records. Otherwise 
intermittent fail in kafka write will cause retry generate records and the hash 
value will change.
   * For Python pipeline, do not add `--streaming` for now.
   
   Issues revealed and need investigation:
   
   * For Java, investigate the performance degration when ReShuffle used 
downstream of Kafka read. Note that Java ReShuffle is marked as "deprecated".
   * For Python xlang, why CountMetrics transform downstream causing no 
parallelism when bundle is fused (this does not happen in Java pipeline).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to