[ 
https://issues.apache.org/jira/browse/GEODE-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291981#comment-17291981
 ] 

Hale Bales commented on GEODE-8950:
-----------------------------------

- the first known CI failure was on 02/04/2021
- we do not have CI history before 02/01/2021
- these failures are occuring both in CI and when run using the scripts
- the test that is failing was added in November of 2020
- running develop against 1.13.0 does not produce consistent benchmark results
- running with a baseline of 1.13.1 does not improve the failure rate
- running 1.13.0 against itself does not produce consistently passing results
- running develop against itself does not produce consistently passing results
- there have been no changes to benchmarks this year (as of feb 26, 2021)
- there do not appear to be any suspect changes to geode core made this year
  - Jake Barrett, Donal Evans, and I have all looked at the commits
  - no commits are in the right area of the code
  - I have tested all code changes that even had the slightest chance of 
    changing the performance in P2pPartitionedPutLongBenchmark
  - the changes to dependencies do not seem to have changed the performance
- profiling the test for the following did not produce any useful information:
  - cpu usage
  - allocations
  - locks
- looking at the gfs logs showed that (on a failing run):
  - develop did fewer puts than 1.13.0
  - develop had less cpu activity
  - develop received fewer bytes
  - these results are expected for a run where develop had lower throughput than
    1.13.0
- this benchmark has a very small payload size
  - in the past the performance team saw a high degree of sensitivity in tests
    with small payloads


conclusions:
- these failures do not appear to be caused by any code change
- these failures do not appear to be caused by any benchmarking change
- these failures do not appear to be caused by any dependency change
- the instability when running the same version/commit against itself points to 
  the issue being the overhead for each operation for such a small payload
- there is no data to support that this failure is occuring more often than
  previously

proposed next stepts:
- keep running this test and keep track of the failure rate
- if the failure rate increases, investigate the peer-to-peer code
- if the failure rate stays the same, comment out the test
- long term, invest time in a significant refactor of the peer-to-peer code

> Benchmark failure - P2pPartitionedPutLongBenchmark
> --------------------------------------------------
>
>                 Key: GEODE-8950
>                 URL: https://issues.apache.org/jira/browse/GEODE-8950
>             Project: Geode
>          Issue Type: Bug
>          Components: benchmarks
>    Affects Versions: 1.15.0
>            Reporter: Donal Evans
>            Assignee: Hale Bales
>            Priority: Major
>
> Multiple benchmark failures due to P2pPartitionedPutLongBenchmark have been 
> seen recently.
> This run saw 3 out of the 5 repeats fail due to flagged degradations in 
> P2pPartitionedPutLongBenchmark: 
> [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/Benchmark_base/builds/16|https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/Benchmark_base/builds/16#L601ed52d:5552]
> This run saw 1 out of the 5 repeats fail due to flagged degradations in 
> P2pPartitionedPutLongBenchmark: 
> [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/Benchmark_base/builds/20]
> This run saw 4 out of the 5 repeats fail due to flagged degradations in 
> P2pPartitionedPutLongBenchmark: 
> [https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/Benchmark_base/builds/27]
> In all the above benchmarks, the other failed runs were due to EOFExceptions 
> rather than flagged degradations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to