Re: [DISCUSS] Improve Commitlog write path

Bowen Song via dev Fri, 22 Jul 2022 04:07:47 -0700

Hi Amit,

The compaction bottleneck is not an instantly visible limitation. It ineffect limits the total size of writes over a fairly long period oftime, because compaction is asynchronous and can be queued. That meansif compaction can't keep up with the writes, they will be queued, andCassandra remains fully functional until hitting the "too many openfiles" error or the filesystem runs out of free inodes. This can happenover many days or even weeks.

For the purpose of benchmarking, you may prefer to measure the maxconcurrent compaction throughput, instead of actually waiting for thatbreaking moment. The max write throughput is a fraction of the maxconcurrent compaction throughput, usually by a factor of 5 or more for anon-trivial sized table, depending on the table size in bytes. Searchfor "STCS write amplification" to understand why that's the case. Thatmeans if you've measured the max concurrent compaction throughput is1GB/s, your average max insertion speed over a period of time isprobably less than 200MB/s.

If you really decide to test the compaction bottleneck in action, it'sbetter to measure the table size in bytes on disk, rather than thenumber of records. That's because not only the record count, but alsothe size of partitions and compression ratio, all have meaningful effecton the compaction workload. It's also worth mentioning that if using theSTCS strategy, which is more suitable for write heavy workload, you maywant to keep an eye on the SSTable data file size distribution.Initially the compaction may not involve any large SSTable data file, soit won't be a bottleneck at all. As more bigger SSTable data files arecreated over time, they will get involved in compactions more and morefrequently. The bottleneck will only shows up (i.e. become problematic)when there's sufficient number of large SSTable data files involved inmultiple concurrent compactions, occupying all available compactors andblocks (queuing) a larger number of compactions involving smallerSSTable data files.



Regards,

Bowen


On 22/07/2022 11:19, Pawar, Amit wrote:


[Public]

Thank you Bowen for your reply. Took some time to respond due totesting issue.

I tested again multi-threaded feature with number of records from 260million to 2 billion and still improvement is seen around 80% ofRamdisk score. It is still possible that compaction can become newbottleneck and could be new opportunity to fix it. I am newbie hereand possible that I failed to understand your suggestion completely. At-least with this testing multi-threading benefit is reflecting inscore.

Do you think multi-threading is good to have now ? else please suggestif I need to test further.


Thanks,

Amit

*From:* Bowen Song via dev <dev@cassandra.apache.org>
*Sent:* Wednesday, July 20, 2022 4:13 PM
*To:* dev@cassandra.apache.org
*Subject:* Re: [DISCUSS] Improve Commitlog write path

[CAUTION: External Email]

From my past experience, the bottleneck for insert heavy workload islikely to be compaction, not commit log. You initially may see commitlog as the bottleneck when the table size is relatively small, but asthe table size increases, compaction will likely take its place andbecome the new bottleneck.

On 20/07/2022 11:11, Pawar, Amit wrote:

[Public]

Hi all,

(My previous mail is not appearing in mailing list and resending
again after 2 days)

Myself Amit and working at AMD Bangalore, India. I am new to
Cassandra and need to do Cassandra testing on large core systems.
Usually should test on multi-nodes Cassandra but started with
Single node testing to understand how Cassandra scales with
increasing core counts.

Test details:

Operation: Insert > 90% (insert heavy)

Operation: Scan < 10%

Cassandra: 3.11.10 and trunk

Benchmark: TPCx-IOT (similar to YCSB)

Results shows scaling is poor beyond 16 cores and it is almost
linear. Following settings are the common settings helped to get
the better scores.

1. Memtable heap allocation: offheap_objects
2. memtable_flush_writers > 4
3. Java heap: 8-32GB with survivor ratio tuning
4. Separate storage space for Commitlog and Data.

Many online blogs suggest to add new Cassandra node when unable to
take high writes. But with large systems, high writes should be
easily taken due to many cores. Need was to improve the scaling
with more cores so this suggestion didn’t help. After many rounds
of testing it was observed that current implementation uses single
thread for Commitlog syncing activity. Commitlog files are mapped
using mmap system call and changes are written with msync.
Periodic syncing with JVisualvm tool shows

1. thread is not 100% busy with Ramdisk usage for Commitlog
storage and scaling improved on large systems. Ramdisk scores
> 2 X NVME score.
2. thread becomes 100% busy with NVME usage for Commiglog and
score does not improve much beyond 16 cores.

Linux kernel uses 4K pages for mapped memory with mmap system
call. So, to understand this further, disk I/O testing was done
using FIO tool and results shows

1. NVME 4K random R/W throughput is very less with single thread
and it improves with multi-threaded.
2. Ramdisk 4K random R/W throughput is good with single thread
only and also better with multi-threaded

Based on the FIO test results following two ideas were tested for
Commitlog files with Cassandra-3.1.10 sources.

1. Enable Direct IO feature for Commitlog files (similar to
[CASSANDRA-14466] Enable Direct I/O - ASF JIRA (apache.org)

<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-14466&data=05%7C01%7CAmit.Pawar%40amd.com%7Cd547d8d71be340f1efbe08da6a3ca687%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637939105980598112%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=GrX2Pdymv%2FhIRhtcB28NnWpF6VR5YyGLpeHnSn1S7DY%3D&reserved=0>
)
2. Enable Multi-threaded syncing for Commitlog files.

First one need to retest. Interestingly second one helped to
improve the score with “NVME” disk. NVME disk configuration score
is almost within 80-90% of ramdisk and 2 times of single threaded
implementation. Multithreading enabled by adding new thread pool
in “AbstractCommitLogSegmentManager” class and changed syncing
thread as manager thread for this new thread pool to take care
synchronization. Only tested with Cassandra-3.11.10 and needs
complete testing but this change is working in my test
environment. Tried these few experiments so that I could discuss
here and seek your valuable suggestions to identify the right fix
for insert heavy workloads.

1. Is it good idea to convert single threaded syncing to
multi-threading implementation to improve the disk IO?
2. Direct I/O throughput is high with single thread and best fit
for Commitlog case due to file size. This will improve writes
on small to large systems. Good to bring this support for
Commitlog files?

Please suggest.

Thanks,

Amit Pawar

Re: [DISCUSS] Improve Commitlog write path

Reply via email to