[
https://issues.apache.org/jira/browse/IGNITE-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-23240:
-----------------------------------
Description:
h1. Preface
Current implementation, based on {{{}RocksDB{}}}, is known to be way slower
then it should be. There are multiple obvious reasons for that:
* Writing into WAL +and+ memtable
* Creating unique keys for every record
* Inability to efficiently serialize data, we must have an intermediate state
before we pass data into {{{}RocksDB{}}}'s API.
h1. Benchmarks
h3. Local benchmarks
Local benchmarks ({{{}LogStorageBenchmarks{}}}) have been performed on my local
environment with fsync disabled. I got the following results:
* {{{}Logit{}}}:
{noformat}
Test write:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 23.541
Total size : 16777216000
Throughput(bps) : 712680684
Throughput(rps) : 43498
Test read:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 3.808
Total size : 16777216000
Throughput(bps) : 4405781512
Throughput(rps) : 268907
Test done!{noformat}
* {{{}RocksDB{}}}:
{noformat}
Test write:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 178.785
Total size : 16777216000
Throughput(bps) : 93840176
Throughput(rps) : 5727
Test read:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 13.572
Total size : 16777216000
Throughput(bps) : 1236163866
Throughput(rps) : 75449
Test done!{noformat}
While testing on local environment is not optimal, is still shows a huge
improvement in writing speed (7.5x) and reading speed (3.5x). Enabling
{{fsync}} sort-of equalizes writing speed, but we still expect that simpler log
implementation would be faster dues to smaller overall overhead.
h3. Integration testing
Benchmark for 3 servers and 1 client writing data in multiple threads shows
34438 vs 30299 throughput improvement.
{{{}RocksDB{}}}:
!Screenshot from 2024-09-20 10-38-53.png!
{{{}Logit{}}}:
!Screenshot from 2024-09-20 10-38-57.png!
Benchmark for single thread insertions in embedded mode shows 4072 vs 3739
throughput improvement.
{{{}RocksDB{}}}:
!Screenshot from 2024-09-20 10-42-49.png!
{{{}Logit{}}}:
!Screenshot from 2024-09-20 10-43-09.png!
h1. Observations
Despite a drastic difference in log throughput, user operations throughput
increase is only about 10%. This means that we lose a lot of time elsewhere,
and optimizing those parts could significantly increase performance too. Log
optimizations would become more evident after that.
h1. Unsolved issues
There are multiple issues with new log implementation, some of them have been
mentioned in IGNITE-22843
* {{Logit}} pre-allocates _a lot_ of data on drive. Considering that we use
"log per partition" paradigm, it's too wasteful.
* Storing separate log file per partition is not scalable anyway, it's too
difficult to optimize batches and {{fsync}} in this approach.
* Using the same log for all tables in a distribution zone won't really solve
the issue, the best it could do is to make it {_}manageable{_}, in some sense.
h1. Shortly about how Logit works
Each log consists of 3 sets of files:
* "segment" files with data.
* "configuration" files with raft configuration.
* "index" files with pointers to segment and configuration files.
"segment" and "configuration" files contain chunks of data in a following
format:
|Magic header|Payload size|Payload itself|
"index" files contain following pieces of data:
|Magic header|Log entry type (data/cfg)|offset|position|
It's a fixed-length tuple, that contains a "link" to one of data files. Each
"index" file is basically an offset table, and it is used to resolve "logIndex"
into real log data.
h1. What we should change
A list of actions, that we need to do to make this log fit the required
criteria includes:
*
was:
h1. Preface
Current implementation, based on {{{}RocksDB{}}}, is known to be way slower
then it should be. There are multiple obvious reasons for that:
* Writing into WAL +and+ memtable
* Creating unique keys for every record
* Inability to efficiently serialize data, we must have an intermediate state
before we pass data into {{{}RocksDB{}}}'s API.
h1. Benchmarks
h3. Local benchmarks
Local benchmarks ({{{}LogStorageBenchmarks{}}}) have been performed on my local
environment with fsync disabled. I got the following results:
* {{{}Logit{}}}:
{noformat}
Test write:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 23.541
Total size : 16777216000
Throughput(bps) : 712680684
Throughput(rps) : 43498
Test read:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 3.808
Total size : 16777216000
Throughput(bps) : 4405781512
Throughput(rps) : 268907
Test done!{noformat}
* {{{}RocksDB{}}}:
{noformat}
Test write:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 178.785
Total size : 16777216000
Throughput(bps) : 93840176
Throughput(rps) : 5727
Test read:
Log number : 1024000
Log Size : 16384
Batch Size : 100
Cost time(s) : 13.572
Total size : 16777216000
Throughput(bps) : 1236163866
Throughput(rps) : 75449
Test done!{noformat}
While testing on local environment is not optimal, is still shows a huge
improvement in writing speed (7.5x) and reading speed (3.5x). Enabling
{{fsync}} sort-of equalizes writing speed, but we still expect that simpler log
implementation would be faster dues to smaller overall overhead.
h3. Integration testing
Benchmark for 3 servers and 1 client writing data in multiple threads shows
34438 vs 30299 throughput improvement.
{{{}RocksDB{}}}:
!Screenshot from 2024-09-20 10-38-53.png!
{{{}Logit{}}}:
!Screenshot from 2024-09-20 10-38-57.png!
Benchmark for single thread insertions in embedded mode shows 4072 vs 3739
throughput improvement.
{{{}RocksDB{}}}:
!Screenshot from 2024-09-20 10-42-49.png!
{{{}Logit{}}}:
!Screenshot from 2024-09-20 10-43-09.png!
h1. Observations
Despite a drastic difference in log throughput, user operations throughput
increase is only about 10%. This means that we lose a lot of time elsewhere,
and optimizing those parts could significantly increase performance too. Log
optimizations would become more evident after that.
h1. Unsolved issues
There are multiple issues with new log implementation, most of them have been
mentioned in
[IGNITE-22843|https://issues.apache.org/jira/browse/IGNITE-22843?focusedCommentId=17871250&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17871250]
> Ignite 3 new log storage
> ------------------------
>
> Key: IGNITE-23240
> URL: https://issues.apache.org/jira/browse/IGNITE-23240
> Project: Ignite
> Issue Type: Epic
> Reporter: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
> Attachments: Screenshot from 2024-09-20 10-38-53.png, Screenshot from
> 2024-09-20 10-38-57.png, Screenshot from 2024-09-20 10-42-49.png, Screenshot
> from 2024-09-20 10-43-09.png
>
>
> h1. Preface
> Current implementation, based on {{{}RocksDB{}}}, is known to be way slower
> then it should be. There are multiple obvious reasons for that:
> * Writing into WAL +and+ memtable
> * Creating unique keys for every record
> * Inability to efficiently serialize data, we must have an intermediate
> state before we pass data into {{{}RocksDB{}}}'s API.
> h1. Benchmarks
> h3. Local benchmarks
> Local benchmarks ({{{}LogStorageBenchmarks{}}}) have been performed on my
> local environment with fsync disabled. I got the following results:
> * {{{}Logit{}}}:
> {noformat}
> Test write:
> Log number : 1024000
> Log Size : 16384
> Batch Size : 100
> Cost time(s) : 23.541
> Total size : 16777216000
> Throughput(bps) : 712680684
> Throughput(rps) : 43498
> Test read:
> Log number : 1024000
> Log Size : 16384
> Batch Size : 100
> Cost time(s) : 3.808
> Total size : 16777216000
> Throughput(bps) : 4405781512
> Throughput(rps) : 268907
> Test done!{noformat}
> * {{{}RocksDB{}}}:
> {noformat}
> Test write:
> Log number : 1024000
> Log Size : 16384
> Batch Size : 100
> Cost time(s) : 178.785
> Total size : 16777216000
> Throughput(bps) : 93840176
> Throughput(rps) : 5727
> Test read:
> Log number : 1024000
> Log Size : 16384
> Batch Size : 100
> Cost time(s) : 13.572
> Total size : 16777216000
> Throughput(bps) : 1236163866
> Throughput(rps) : 75449
> Test done!{noformat}
> While testing on local environment is not optimal, is still shows a huge
> improvement in writing speed (7.5x) and reading speed (3.5x). Enabling
> {{fsync}} sort-of equalizes writing speed, but we still expect that simpler
> log implementation would be faster dues to smaller overall overhead.
> h3. Integration testing
> Benchmark for 3 servers and 1 client writing data in multiple threads shows
> 34438 vs 30299 throughput improvement.
> {{{}RocksDB{}}}:
> !Screenshot from 2024-09-20 10-38-53.png!
> {{{}Logit{}}}:
> !Screenshot from 2024-09-20 10-38-57.png!
> Benchmark for single thread insertions in embedded mode shows 4072 vs 3739
> throughput improvement.
> {{{}RocksDB{}}}:
> !Screenshot from 2024-09-20 10-42-49.png!
> {{{}Logit{}}}:
> !Screenshot from 2024-09-20 10-43-09.png!
> h1. Observations
> Despite a drastic difference in log throughput, user operations throughput
> increase is only about 10%. This means that we lose a lot of time elsewhere,
> and optimizing those parts could significantly increase performance too. Log
> optimizations would become more evident after that.
> h1. Unsolved issues
> There are multiple issues with new log implementation, some of them have been
> mentioned in IGNITE-22843
> * {{Logit}} pre-allocates _a lot_ of data on drive. Considering that we use
> "log per partition" paradigm, it's too wasteful.
> * Storing separate log file per partition is not scalable anyway, it's too
> difficult to optimize batches and {{fsync}} in this approach.
> * Using the same log for all tables in a distribution zone won't really
> solve the issue, the best it could do is to make it {_}manageable{_}, in some
> sense.
> h1. Shortly about how Logit works
> Each log consists of 3 sets of files:
> * "segment" files with data.
> * "configuration" files with raft configuration.
> * "index" files with pointers to segment and configuration files.
> "segment" and "configuration" files contain chunks of data in a following
> format:
>
> |Magic header|Payload size|Payload itself|
> "index" files contain following pieces of data:
> |Magic header|Log entry type (data/cfg)|offset|position|
> It's a fixed-length tuple, that contains a "link" to one of data files. Each
> "index" file is basically an offset table, and it is used to resolve
> "logIndex" into real log data.
>
> h1. What we should change
> A list of actions, that we need to do to make this log fit the required
> criteria includes:
> *
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)