hlteoh37 commented on PR #22901:
URL: https://github.com/apache/flink/pull/22901#issuecomment-1617830035

   ## Performing performance test on `/checkpoints` API
   
   From the below tests, we ran the following:
   1. Control test where we don't have changes to Flink REST API. 
ExecutionGraph cache of 30s
   2. Checkpoint cache with 3s cache
   3. No checkpoint cache
   
   We can see that there is no significant differences in the access pattern / 
latency pattern of the REST API for `/checkpoints`.
   
   We also found the following nuances:
   - Periodic `500` returned by `/checkpoints` on the first call after a job 
restart (likely that there are no "Quantile" statistics after only 1 
checkpoint. **This is consistent for all 3 tests, meaning its an independent 
issue.**
   - p95 access pattern seems spiky. Seems like averages around 15ms, but goes 
up to 40 / 50ms occasionally. **This pattern is consistent across all tests, so 
this is not a problem introduced by latest change.**
   
   
   ### Test configuration
   - Cluster setup
     - Single JM with Zookeeper HA
     - Running on K8s
     - JM resources (vCPU: 200m, Memory: 6083Mi)
     - ResourceManager, JobMaster, BlobServer all running on the same JM 
container.
   - Test setup
     - TPS: 10 req/s
     - Endpoint `/jobs/:jobid/checkpoints`
     - Simple Flink job reading from KDS writing to KDS
     - Checkpoint interval 1s, min pause between checkpoints 1s
   
   ### Control Test 1 (no changes to Flink REST API): ExecutionGraph cache of 
3s, Job restarting constantly
   
   ![Screenshot 2023-07-03 at 11 21 
01](https://github.com/apache/flink/assets/35062175/ec954690-d48e-4073-9e53-05caec210b72)
   
   
   JM CPU/memory use
   
   ```
   NAME                                         CPU(cores)   MEMORY(bytes)
   flink-jobmanager-dc5d8b5df-875x6             51m          1271Mi
   ```
   ![Screenshot 2023-07-03 at 11 20 
54](https://github.com/apache/flink/assets/35062175/908a6714-ea15-4aac-a7fc-332589da2582)
   
   
   ### Test 2: 3s checkpoint cache, Job restarting constantly
   
   
   ![Screenshot 2023-07-03 at 10 50 
44](https://github.com/apache/flink/assets/35062175/d49e3abf-bd80-424a-8732-4c11b8c99fa7)
   
   JM CPU/memory use
   
   ```
   NAME                                         CPU(cores)   MEMORY(bytes)
   flink-jobmanager-6797f4cdbd-xll86            70m          1237Mi
   ```
   
   
   ![Screenshot 2023-07-03 at 10 47 
53](https://github.com/apache/flink/assets/35062175/ef0d1ad4-7d92-4afb-bd26-b3ed4c4c543b)
   
   
   ### Test 3: No checkpoint cache, Job restarting constantly
   
   ![Screenshot 2023-07-03 at 11 09 
43](https://github.com/apache/flink/assets/35062175/be8b6f93-9310-4571-b7fa-7fbe317ad693)
   
   JM CPU/memory use
   
   ```
   NAME                                         CPU(cores)   MEMORY(bytes)
   flink-jobmanager-79f5768fdc-2t9hd            50m          1272Mi
   ```
    
   ![Screenshot 2023-07-03 at 11 09 
10](https://github.com/apache/flink/assets/35062175/7aa4047e-65bf-4034-99a3-4a3a534828da)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to