[
https://issues.apache.org/jira/browse/IGNITE-22878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-22878:
-----------------------------------
Description:
h1. Results
I put it right here, because comments can be missed easily.
* The main reason of performance dips is the fact that we locate both raft log
and table data on the same storage device. We should test a configuration where
they are separated.
* {{rocksdb}} based log storage adds minor issues during its flush and
compaction, it might cause 10-20% dips. It's not too critical, but it once
again shows downsides of current implementation.
Reducing the number of threads that write SST files and compact them doesn't
seem to do anything, although it's hard to say precisely. This part is not
configurable, but I would investigate separately, whether or not it would make
sense to set those values to 1.
* Nothing really changes when you disable fsync.
* Table data checkpoints and compaction have the most impact. For some reason,
first checkpoint impacts the performance the worst, maybe due to some kind of a
warmup.
Making checkpoints more frequent helps smoothing out the graph a little.
Reducing the number of checkpoint threads and compaction threads also helps
smoothing out the graph, effects are more visible. Checkpoints become longer,
obviously, but still don't overlap in single-put KV tests even under high load.
h1. Description
Build under test: Ignite 3, rev. 1e8959c0a000f0901085eb0b11b37db4299fa72a
Benchmark:
[https://github.com/gridgain/YCSB/blob/ycsb-2024.14/ignite3/src/main/java/site/ycsb/db/ignite3/IgniteClient.java]
h1. Test environment
6 AWS VMs of type c5d.4xlarge:
* vCPU 16
* Memory 32
* Storage 400 NVMe SSD
* Network up to 10 Gbps
h1. Test
Start 3 Ignite nodes (one node per host). Configuration:
* raft.fsync=false
* partitions=16
* replicas=1
Start 3 YCSB clients (one client per host). Each YCSB client spawns 32 load
threads and works with own key range. Parameters:
* Client 1: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
status.interval=1 -p partitions=16 -p insertstart=5100000 -p
insertcount=5000000 -s}}
* Client 2: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
status.interval=1 -p partitions=16 -p insertstart=0 -p insertcount=5000000 -s}}
* {{{}Client 3: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
status.interval=1 -p partitions=16 -p insertstart=10200000 -p
insertcount=5000000 -s{}}}}}
h1. Results
Results from each client are in the separate files (attached).
>From these files we can draw transactions-per-second graphs:
!cl1.png!!cl2.png!!cl3.png!
Take a look at these sinks. We need to investigate the cause of them.
was:
h1. Results
I put it right here, because comments can be missed easily.
* The main reason of performance dips is the fact that we locate both raft log
and table data on the same storage device. We should test a configuration where
they are separated.
* {{rocksdb}} based log storage adds minor issues during its flush and
compaction, it might cause 10-20% dips. It's not too critical, but it once
again shows downsides of current implementation.
Reducing the number of threads that write SST files and compact them doesn't
seem to do anything, although it's hard to say precisely. This part is not
configurable, but I would investigate separately, whether or not it would make
sense to set those values to 1.
* Table data checkpoints and compaction have the most impact. For some reason,
first checkpoint impacts the performance the worst, maybe due to some kind of a
warmup.
Making checkpoints more frequent helps smoothing out the graph a little.
Reducing the number of checkpoint threads and compaction threads also helps
smoothing out the graph, effects are more visible. Checkpoints become longer,
obviously, but still don't overlap in single-put KV tests even under high load.
h1. Description
Build under test: Ignite 3, rev. 1e8959c0a000f0901085eb0b11b37db4299fa72a
Benchmark:
[https://github.com/gridgain/YCSB/blob/ycsb-2024.14/ignite3/src/main/java/site/ycsb/db/ignite3/IgniteClient.java]
h1. Test environment
6 AWS VMs of type c5d.4xlarge:
* vCPU 16
* Memory 32
* Storage 400 NVMe SSD
* Network up to 10 Gbps
h1. Test
Start 3 Ignite nodes (one node per host). Configuration:
* raft.fsync=false
* partitions=16
* replicas=1
Start 3 YCSB clients (one client per host). Each YCSB client spawns 32 load
threads and works with own key range. Parameters:
* Client 1: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
status.interval=1 -p partitions=16 -p insertstart=5100000 -p
insertcount=5000000 -s}}
* Client 2: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
status.interval=1 -p partitions=16 -p insertstart=0 -p insertcount=5000000 -s}}
* {{{}Client 3: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
/opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
-p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
status.interval=1 -p partitions=16 -p insertstart=10200000 -p
insertcount=5000000 -s{}}}}}
h1. Results
Results from each client are in the separate files (attached).
>From these files we can draw transactions-per-second graphs:
!cl1.png!!cl2.png!!cl3.png!
Take a look at these sinks. We need to investigate the cause of them.
> Periodic latency sinks on key-value KeyValueView#put
> ----------------------------------------------------
>
> Key: IGNITE-22878
> URL: https://issues.apache.org/jira/browse/IGNITE-22878
> Project: Ignite
> Issue Type: Bug
> Components: cache
> Affects Versions: 3.0.0-beta2
> Reporter: Ivan Artiukhov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Attachments: 2024-08-01-11-36-02_192.168.208.148_kv_load.txt,
> 2024-08-01-11-36-02_192.168.209.141_kv_load.txt,
> 2024-08-01-11-36-02_192.168.209.191_kv_load.txt, cl1.png, cl2.png, cl3.png
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> h1. Results
> I put it right here, because comments can be missed easily.
> * The main reason of performance dips is the fact that we locate both raft
> log and table data on the same storage device. We should test a configuration
> where they are separated.
> * {{rocksdb}} based log storage adds minor issues during its flush and
> compaction, it might cause 10-20% dips. It's not too critical, but it once
> again shows downsides of current implementation.
> Reducing the number of threads that write SST files and compact them doesn't
> seem to do anything, although it's hard to say precisely. This part is not
> configurable, but I would investigate separately, whether or not it would
> make sense to set those values to 1.
> * Nothing really changes when you disable fsync.
> * Table data checkpoints and compaction have the most impact. For some
> reason, first checkpoint impacts the performance the worst, maybe due to some
> kind of a warmup.
> Making checkpoints more frequent helps smoothing out the graph a little.
> Reducing the number of checkpoint threads and compaction threads also helps
> smoothing out the graph, effects are more visible. Checkpoints become longer,
> obviously, but still don't overlap in single-put KV tests even under high
> load.
> h1. Description
> Build under test: Ignite 3, rev. 1e8959c0a000f0901085eb0b11b37db4299fa72a
> Benchmark:
> [https://github.com/gridgain/YCSB/blob/ycsb-2024.14/ignite3/src/main/java/site/ycsb/db/ignite3/IgniteClient.java]
>
> h1. Test environment
> 6 AWS VMs of type c5d.4xlarge:
> * vCPU 16
> * Memory 32
> * Storage 400 NVMe SSD
> * Network up to 10 Gbps
> h1. Test
> Start 3 Ignite nodes (one node per host). Configuration:
> * raft.fsync=false
> * partitions=16
> * replicas=1
> Start 3 YCSB clients (one client per host). Each YCSB client spawns 32 load
> threads and works with own key range. Parameters:
> * Client 1: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
> /opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
> hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
> -p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
> status.interval=1 -p partitions=16 -p insertstart=5100000 -p
> insertcount=5000000 -s}}
> * Client 2: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
> /opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
> hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
> -p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
> status.interval=1 -p partitions=16 -p insertstart=0 -p insertcount=5000000
> -s}}
> * {{{}Client 3: {{-db site.ycsb.db.ignite3.IgniteClient -load -P
> /opt/pubagent/poc/config/ycsb/workloads/workloadc -threads 32 -p
> hosts=192.168.208.221,192.168.210.120,192.168.211.201 -p recordcount=15300000
> -p warmupops=100000 -p dataintegrity=true -p measurementtype=timeseries -p
> status.interval=1 -p partitions=16 -p insertstart=10200000 -p
> insertcount=5000000 -s{}}}}}
> h1. Results
> Results from each client are in the separate files (attached).
> From these files we can draw transactions-per-second graphs:
> !cl1.png!!cl2.png!!cl3.png!
> Take a look at these sinks. We need to investigate the cause of them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)