[jira] [Updated] (IGNITE-22878) Periodic latency sinks on key-value KeyValueView#put

Ivan Bessonov (Jira) Mon, 26 Aug 2024 01:13:09 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Bessonov updated IGNITE-22878:
-----------------------------------
    Description: 
h1. Results

I put it right here, because comments can be missed easily.
 * The main reason of performance dips is the fact that we locate both raft log 
and table data on the same storage device. We should test a configuration where 
they are separated.
 * {{rocksdb}} based log storage adds minor issues during its flush and 
compaction, it might cause 10-20% dips. It's not too critical, but it once 
again shows downsides of current implementation.
Reducing the number of threads that write SST files and compact them doesn't 
seem to do anything, although it's hard to say precisely. This part is not 
configurable, but I would investigate separately, whether or not it would make 
sense to set those values to 1.
 * Nothing really changes when you disable fsync.
 * Table data checkpoints and compaction have the most impact. For some reason, 
first checkpoint impacts the performance the worst, maybe due to some kind of a 
warmup.
Making checkpoints more frequent helps smoothing out the graph a little.
Reducing the number of checkpoint threads and compaction threads also helps 
smoothing out the graph, effects are more visible. Checkpoints become longer, 
obviously, but still don't overlap in single-put KV tests even under high load.

What's implemented in current JIRA:
 * Basic logs of rocksdb compaction.
 * Basic logs of aipersist compaction, that should be expanded in 
https://issues.apache.org/jira/browse/IGNITE-23056.

h1. Description

Build under test: Ignite 3, rev. 1e8959c0a000f0901085eb0b11b37db4299fa72a

Benchmark: 
[https://github.com/gridgain/YCSB/blob/ycsb-2024.14/ignite3/src/main/java/site/ycsb/db/ignite3/IgniteClient.java]
 
h1. Test environment

6 AWS VMs of type c5d.4xlarge:
 * vCPU    16
 * Memory    32
 * Storage    400 NVMe SSD
 * Network    up to 10 Gbps

h1. Test

Start 3 Ignite nodes (one node per host). Configuration:
 * raft.fsync=false
 * partitions=16
 * replicas=1

Start 3 YCSB clients (one client per host). Each YCSB client spawns 32 load 
threads and works with own key range. Parameters:
 * Client 1: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
status.interval=1 -p partitions=16 -p insertstart=5100000 -p 
insertcount=5000000 -s}}
 * Client 2: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
status.interval=1 -p partitions=16 -p insertstart=0 -p insertcount=5000000 -s}}
 * {{{}Client 3: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
status.interval=1 -p partitions=16 -p insertstart=10200000 -p 
insertcount=5000000 -s{}}}}}

h1. Results

Results from each client are in the separate files (attached). 

>From these files we can draw transactions-per-second graphs:

!cl1.png!!cl2.png!!cl3.png!

Take a look at these sinks. We need to investigate the cause of them.

  was:
h1. Results

I put it right here, because comments can be missed easily.
 * The main reason of performance dips is the fact that we locate both raft log 
and table data on the same storage device. We should test a configuration where 
they are separated.
 * {{rocksdb}} based log storage adds minor issues during its flush and 
compaction, it might cause 10-20% dips. It's not too critical, but it once 
again shows downsides of current implementation.
Reducing the number of threads that write SST files and compact them doesn't 
seem to do anything, although it's hard to say precisely. This part is not 
configurable, but I would investigate separately, whether or not it would make 
sense to set those values to 1.
 * Nothing really changes when you disable fsync.
 * Table data checkpoints and compaction have the most impact. For some reason, 
first checkpoint impacts the performance the worst, maybe due to some kind of a 
warmup.
Making checkpoints more frequent helps smoothing out the graph a little.
Reducing the number of checkpoint threads and compaction threads also helps 
smoothing out the graph, effects are more visible. Checkpoints become longer, 
obviously, but still don't overlap in single-put KV tests even under high load.

h1. Description

Build under test: Ignite 3, rev. 1e8959c0a000f0901085eb0b11b37db4299fa72a

Benchmark: 
[https://github.com/gridgain/YCSB/blob/ycsb-2024.14/ignite3/src/main/java/site/ycsb/db/ignite3/IgniteClient.java]
 
h1. Test environment

6 AWS VMs of type c5d.4xlarge:
 * vCPU    16
 * Memory    32
 * Storage    400 NVMe SSD
 * Network    up to 10 Gbps

h1. Test

Start 3 Ignite nodes (one node per host). Configuration:
 * raft.fsync=false
 * partitions=16
 * replicas=1

Start 3 YCSB clients (one client per host). Each YCSB client spawns 32 load 
threads and works with own key range. Parameters:
 * Client 1: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
status.interval=1 -p partitions=16 -p insertstart=5100000 -p 
insertcount=5000000 -s}}
 * Client 2: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
status.interval=1 -p partitions=16 -p insertstart=0 -p insertcount=5000000 -s}}
 * {{{}Client 3: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
status.interval=1 -p partitions=16 -p insertstart=10200000 -p 
insertcount=5000000 -s{}}}}}

h1. Results

Results from each client are in the separate files (attached). 

>From these files we can draw transactions-per-second graphs:

!cl1.png!!cl2.png!!cl3.png!

Take a look at these sinks. We need to investigate the cause of them.


> Periodic latency sinks on key-value KeyValueView#put
> ----------------------------------------------------
>
>                 Key: IGNITE-22878
>                 URL: https://issues.apache.org/jira/browse/IGNITE-22878
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 3.0.0-beta2
>            Reporter: Ivan Artiukhov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: 2024-08-01-11-36-02_192.168.208.148_kv_load.txt, 
> 2024-08-01-11-36-02_192.168.209.141_kv_load.txt, 
> 2024-08-01-11-36-02_192.168.209.191_kv_load.txt, cl1.png, cl2.png, cl3.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h1. Results
> I put it right here, because comments can be missed easily.
>  * The main reason of performance dips is the fact that we locate both raft 
> log and table data on the same storage device. We should test a configuration 
> where they are separated.
>  * {{rocksdb}} based log storage adds minor issues during its flush and 
> compaction, it might cause 10-20% dips. It's not too critical, but it once 
> again shows downsides of current implementation.
> Reducing the number of threads that write SST files and compact them doesn't 
> seem to do anything, although it's hard to say precisely. This part is not 
> configurable, but I would investigate separately, whether or not it would 
> make sense to set those values to 1.
>  * Nothing really changes when you disable fsync.
>  * Table data checkpoints and compaction have the most impact. For some 
> reason, first checkpoint impacts the performance the worst, maybe due to some 
> kind of a warmup.
> Making checkpoints more frequent helps smoothing out the graph a little.
> Reducing the number of checkpoint threads and compaction threads also helps 
> smoothing out the graph, effects are more visible. Checkpoints become longer, 
> obviously, but still don't overlap in single-put KV tests even under high 
> load.
> What's implemented in current JIRA:
>  * Basic logs of rocksdb compaction.
>  * Basic logs of aipersist compaction, that should be expanded in 
> https://issues.apache.org/jira/browse/IGNITE-23056.
> h1. Description
> Build under test: Ignite 3, rev. 1e8959c0a000f0901085eb0b11b37db4299fa72a
> Benchmark: 
> [https://github.com/gridgain/YCSB/blob/ycsb-2024.14/ignite3/src/main/java/site/ycsb/db/ignite3/IgniteClient.java]
>  
> h1. Test environment
> 6 AWS VMs of type c5d.4xlarge:
>  * vCPU    16
>  * Memory    32
>  * Storage    400 NVMe SSD
>  * Network    up to 10 Gbps
> h1. Test
> Start 3 Ignite nodes (one node per host). Configuration:
>  * raft.fsync=false
>  * partitions=16
>  * replicas=1
> Start 3 YCSB clients (one client per host). Each YCSB client spawns 32 load 
> threads and works with own key range. Parameters:
>  * Client 1: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
> /opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
> hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
> -p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
> status.interval=1 -p partitions=16 -p insertstart=5100000 -p 
> insertcount=5000000 -s}}
>  * Client 2: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
> /opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
> hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
> -p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
> status.interval=1 -p partitions=16 -p insertstart=0 -p insertcount=5000000 
> -s}}
>  * {{{}Client 3: {{-db site.ycsb.db.ignite3.IgniteClient -load -P 
> /opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p 
> hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000 
> -p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p 
> status.interval=1 -p partitions=16 -p insertstart=10200000 -p 
> insertcount=5000000 -s{}}}}}
> h1. Results
> Results from each client are in the separate files (attached). 
> From these files we can draw transactions-per-second graphs:
> !cl1.png!!cl2.png!!cl3.png!
> Take a look at these sinks. We need to investigate the cause of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-22878) Periodic latency sinks on key-value KeyValueView#put

Reply via email to