RE: [DISCUSS] Improve Commitlog write path

Pawar, Amit Tue, 26 Jul 2022 05:36:42 -0700

[Public]

Hi Bowen,

Thanks for your reply. Now it is clear that what are some benefits of this 
patch. I will send it for review once it is ready and hopefully it gets 
accepted.

Thanks,
Amit

From: Bowen Song via dev <[email protected]>
Sent: Tuesday, July 26, 2022 5:36 PM
To: [email protected]
Subject: Re: [DISCUSS] Improve Commitlog write path

[CAUTION: External Email]

Hi Amit,

That's some brilliant tests you have done there. It shows that the compaction 
throughput not only can be a bottleneck on the speed of insert operations, but 
it can also stress the JVM garbage collector. As a result of GC pressure, it 
can cause other things, such as insert, to fail.

Your last statement is correct. The commit log change can be beneficial for 
atypical workloads where large volume of data is getting inserted and then 
expired soon, for example when using the TimeWindowCompactionStrategy with 
short TTL. But I must point out that this kind of atypical usage is often an 
anti-pattern in Cassandra, as Cassandra is a database, not a queue or cache 
system.

This, however, is not saying the commit log change should not be introduced. As 
others have pointed out, it's down to a balancing act between the cost and 
benefit, and it will depend on the code complexity and the effect it has on 
typical workload, such as CPU and JVM heap usage. After all, we should 
prioritise the performance and reliability of typical usage before optimising 
for atypical use cases.

Best,
Bowen
On 26/07/2022 12:41, Pawar, Amit wrote:

[Public]

Hi Bowen,

Thanks for the reply and it helped to identify the failure point. Tested 
compaction throughput with different values and threads active in compaction 
reports "java.lang.OutOfMemoryError: Map failed" error with 1024 MB/s earlier 
compared to other values. This shows with lower throughput such issues are 
going to come up not immediately but in days or weeks. Test results are given 
below.

+------------+-----------------------+-----------------------+-----------------+
| Records    | Compaction Throughput | 5 large files In GB   | Disk usage (GB) |
+------------+-----------------------+-----------------------+-----------------+
| 2000000000 | 8                     | Not collected         | 500             |
+------------+-----------------------+-----------------------+-----------------+
| 2000000000 | 16                    | Not collected         | 500             |
+------------+-----------------------+-----------------------+-----------------+
| 900000000  | 64                    | 3.5,3.5,3.5,3.5,3.5   | 273             |
+------------+-----------------------+-----------------------+-----------------+
| 900000000  | 128                   | 3.5, 3.9,4.9,8.0, 15  | 287             |
+------------+-----------------------+-----------------------+-----------------+
| 900000000  | 256                   | 11,11,12,16,20        | 359             |
+------------+-----------------------+-----------------------+-----------------+
| 900000000  | 512                   | 14,19,23,27,28        | 469             |
+------------+-----------------------+-----------------------+-----------------+
| 900000000  | 1024                  | 14,18,23,27,28        | 458             |
+------------+-----------------------+-----------------------+-----------------+
| 900000000  | 0                     | 6.9,6.9,7.0,28,28     | 223             |
+------------+-----------------------+-----------------------+-----------------+
|            |                       |                       |                 |
+------------+-----------------------+-----------------------+-----------------+

Issues observed with increasing compaction throughput.

  1.  Out of memory errors
  2.  Scores reduces as throughput increased
  3.  Files size grows as throughput increased
  4.  Insert failures are noticed

After this testing, I feel that this change is beneficial for workloads where 
data is not kept/left on nodes for too long. With lower throughput large system 
can ingest more data. Does it make sense ?

Thanks,
Amit

From: Bowen Song via dev 
<[email protected]><mailto:[email protected]>
Sent: Friday, July 22, 2022 4:37 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: [DISCUSS] Improve Commitlog write path

[CAUTION: External Email]

Hi Amit,

The compaction bottleneck is not an instantly visible limitation. It in effect 
limits the total size of writes over a fairly long period of time, because 
compaction is asynchronous and can be queued. That means if compaction can't 
keep up with the writes, they will be queued, and Cassandra remains fully 
functional until hitting the "too many open files" error or the filesystem runs 
out of free inodes. This can happen over many days or even weeks.
For the purpose of benchmarking, you may prefer to measure the max concurrent 
compaction throughput, instead of actually waiting for that breaking moment. 
The max write throughput is a fraction of the max concurrent compaction 
throughput, usually by a factor of 5 or more for a non-trivial sized table, 
depending on the table size in bytes. Search for "STCS write amplification" to 
understand why that's the case. That means if you've measured the max 
concurrent compaction throughput is 1GB/s, your average max insertion speed 
over a period of time is probably less than 200MB/s.

If you really decide to test the compaction bottleneck in action, it's better 
to measure the table size in bytes on disk, rather than the number of records. 
That's because not only the record count, but also the size of partitions and 
compression ratio, all have meaningful effect on the compaction workload. It's 
also worth mentioning that if using the STCS strategy, which is more suitable 
for write heavy workload, you may want to keep an eye on the SSTable data file 
size distribution. Initially the compaction may not involve any large SSTable 
data file, so it won't be a bottleneck at all. As more bigger SSTable data 
files are created over time, they will get involved in compactions more and 
more frequently. The bottleneck will only shows up (i.e. become problematic) 
when there's sufficient number of large SSTable data files involved in multiple 
concurrent compactions, occupying all available compactors and blocks (queuing) 
a larger number of compactions involving smaller SSTable data files.

Regards,

Bowen

On 22/07/2022 11:19, Pawar, Amit wrote:

[Public]

Thank you Bowen for your reply. Took some time to respond due to testing issue.

I tested again multi-threaded feature with number of records from 260 million 
to 2 billion and still improvement is seen around 80% of Ramdisk score. It is 
still possible that compaction can become new bottleneck and could be new 
opportunity to fix it. I am newbie here and possible that I failed to 
understand your suggestion completely.  At-least with this testing 
multi-threading benefit is reflecting in score.

Do you think multi-threading is good to have now ? else please suggest if I 
need to test further.

Thanks,
Amit

From: Bowen Song via dev 
<[email protected]><mailto:[email protected]>
Sent: Wednesday, July 20, 2022 4:13 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: [DISCUSS] Improve Commitlog write path

[CAUTION: External Email]

>From my past experience, the bottleneck for insert heavy workload is likely to 
>be compaction, not commit log. You initially may see commit log as the 
>bottleneck when the table size is relatively small, but as the table size 
>increases, compaction will likely take its place and become the new bottleneck.
On 20/07/2022 11:11, Pawar, Amit wrote:

[Public]

Hi all,

(My previous mail is not appearing in mailing list and resending again after 2 
days)

Myself Amit and working at AMD Bangalore, India. I am new to Cassandra and need 
to do Cassandra testing on large core systems. Usually should test on 
multi-nodes Cassandra but started with Single node testing to understand how 
Cassandra scales with increasing core counts.

Test details:
Operation: Insert > 90% (insert heavy)
Operation: Scan < 10%
Cassandra: 3.11.10 and trunk
Benchmark: TPCx-IOT (similar to YCSB)

Results shows scaling is poor beyond 16 cores and it is almost linear. 
Following settings are the common settings helped to get the better scores.

1.       Memtable heap allocation: offheap_objects

2.       memtable_flush_writers > 4

3.       Java heap: 8-32GB with survivor ratio tuning

4.       Separate storage space for Commitlog and Data.

Many online blogs suggest to add new Cassandra node when unable to take high 
writes. But with large systems, high writes should be easily taken due to many 
cores. Need was to improve the scaling with more cores so this suggestion 
didn't help. After many rounds of testing it was observed that current 
implementation uses single thread for Commitlog syncing activity. Commitlog 
files are mapped using mmap system call and changes are written with msync. 
Periodic syncing with JVisualvm tool shows

1.       thread is not 100% busy with Ramdisk usage for Commitlog storage and 
scaling improved on large systems. Ramdisk scores > 2 X NVME score.

2.       thread becomes 100% busy with NVME usage for Commiglog and score does 
not improve much beyond 16 cores.

Linux kernel uses 4K pages for mapped memory with mmap system call. So, to 
understand this further, disk I/O testing was done using FIO tool and results 
shows

1.       NVME 4K random R/W throughput is very less with single thread and it 
improves with multi-threaded.

2.       Ramdisk 4K random R/W throughput is good with single thread only and 
also better with multi-threaded

Based on the FIO test results following two ideas were tested for Commitlog 
files with Cassandra-3.1.10 sources.

1.       Enable Direct IO feature for Commitlog files (similar to  
[CASSANDRA-14466] Enable Direct I/O - ASF JIRA 
(apache.org)<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-14466&data=05%7C01%7CAmit.Pawar%40amd.com%7C9a32196e4eca4910d76008da6eff3695%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637944339660848831%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2DQ65QUQa0grzbndirszS1U9NSIrt12JE2AiGM20lsw%3D&reserved=0>
 )

2.       Enable Multi-threaded syncing for Commitlog files.

First one need to retest. Interestingly second one helped to improve the score 
with "NVME" disk. NVME disk configuration score is almost within 80-90% of 
ramdisk and 2 times of single threaded implementation. Multithreading enabled 
by adding new thread pool in "AbstractCommitLogSegmentManager" class and 
changed syncing thread as manager thread for this new thread pool to take care 
synchronization. Only tested with Cassandra-3.11.10 and needs complete testing 
but this change is working in my test environment. Tried these few experiments 
so that I could discuss here and seek your valuable suggestions to identify the 
right fix for insert heavy workloads.

1.       Is it good idea to convert single threaded syncing to multi-threading 
implementation to improve the disk IO?

2.       Direct I/O throughput is high with single thread and best fit for 
Commitlog case due to file size. This will improve writes on small to large 
systems. Good to bring this support for Commitlog files?

Please suggest.

Thanks,
Amit Pawar

RE: [DISCUSS] Improve Commitlog write path

Reply via email to