Hello Amit, This paper may be of interest to you: https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf
We did a range of tests that are similar to your scenario and realized several things early on: - Memory-mapping the commit log in combination with memory-mapped data or index files causes long msync delays. This can be solved by switching to a compressed log. - For smaller mutation sizes, using a large segment size practically removes the commit log bottleneck, even with compression and even though compression is currently single-threaded. - Write performance scaling with available CPU threads is limited by memtable congestion. Scaling can be improved by using a sharded memtable (introduced with CASSANDRA-17034). Needing a compressed log to achieve the best write performance is not ideal, and implementing a non-compressed non-memory-mapped option is fairly easy, with or without Direct IO. If you are looking for a simple performance improvement for the commit log, this is where I would start. Regards, Branimir On Tue, Jul 26, 2022 at 3:36 PM Pawar, Amit <amit.pa...@amd.com> wrote: > [Public] > > > > Hi Bowen, > > > > Thanks for your reply. Now it is clear that what are some benefits of this > patch. I will send it for review once it is ready and hopefully it gets > accepted. > > > > Thanks, > > Amit > > > > *From:* Bowen Song via dev <dev@cassandra.apache.org> > *Sent:* Tuesday, July 26, 2022 5:36 PM > *To:* dev@cassandra.apache.org > *Subject:* Re: [DISCUSS] Improve Commitlog write path > > > > [CAUTION: External Email] > > Hi Amit, > > That's some brilliant tests you have done there. It shows that the > compaction throughput not only can be a bottleneck on the speed of insert > operations, but it can also stress the JVM garbage collector. As a result > of GC pressure, it can cause other things, such as insert, to fail. > > Your last statement is correct. The commit log change can be beneficial > for atypical workloads where large volume of data is getting inserted and > then expired soon, for example when using the TimeWindowCompactionStrategy > with short TTL. But I must point out that this kind of atypical usage is > often an anti-pattern in Cassandra, as Cassandra is a database, not a queue > or cache system. > > This, however, is not saying the commit log change should not be > introduced. As others have pointed out, it's down to a balancing act > between the cost and benefit, and it will depend on the code complexity and > the effect it has on typical workload, such as CPU and JVM heap usage. > After all, we should prioritise the performance and reliability of typical > usage before optimising for atypical use cases. > > Best, > Bowen > > On 26/07/2022 12:41, Pawar, Amit wrote: > > [Public] > > > > Hi Bowen, > > > > Thanks for the reply and it helped to identify the failure point. Tested > compaction throughput with different values and threads active in > compaction reports “java.lang.OutOfMemoryError: Map failed” error with 1024 > MB/s earlier compared to other values. This shows with lower throughput > such issues are going to come up not immediately but in days or weeks. Test > results are given below. > > > > > +------------+-----------------------+-----------------------+-----------------+ > > | Records | Compaction Throughput | 5 large files In GB | Disk usage > (GB) | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 2000000000 | 8 | Not collected | > 500 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 2000000000 | 16 | Not collected | > 500 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 900000000 | 64 | 3.5,3.5,3.5,3.5,3.5 | > 273 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 900000000 | 128 | 3.5, 3.9,4.9,8.0, 15 | > 287 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 900000000 | 256 | 11,11,12,16,20 | > 359 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 900000000 | 512 | 14,19,23,27,28 | > 469 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 900000000 | 1024 | 14,18,23,27,28 | > 458 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | 900000000 | 0 | 6.9,6.9,7.0,28,28 | > 223 | > > > +------------+-----------------------+-----------------------+-----------------+ > > | | | > | | > > > +------------+-----------------------+-----------------------+-----------------+ > > > > Issues observed with increasing compaction throughput. > > 1. Out of memory errors > 2. Scores reduces as throughput increased > 3. Files size grows as throughput increased > 4. Insert failures are noticed > > > > After this testing, I feel that this change is beneficial for workloads > where data is not kept/left on nodes for too long. With lower throughput > large system can ingest more data. Does it make sense ? > > > > Thanks, > > Amit > > > > *From:* Bowen Song via dev <dev@cassandra.apache.org> > <dev@cassandra.apache.org> > *Sent:* Friday, July 22, 2022 4:37 PM > *To:* dev@cassandra.apache.org > *Subject:* Re: [DISCUSS] Improve Commitlog write path > > > > [CAUTION: External Email] > > Hi Amit, > > > > The compaction bottleneck is not an instantly visible limitation. It in > effect limits the total size of writes over a fairly long period of time, > because compaction is asynchronous and can be queued. That means if > compaction can't keep up with the writes, they will be queued, and > Cassandra remains fully functional until hitting the "too many open files" > error or the filesystem runs out of free inodes. This can happen over many > days or even weeks. > > For the purpose of benchmarking, you may prefer to measure the max > concurrent compaction throughput, instead of actually waiting for that > breaking moment. The max write throughput is a fraction of the max > concurrent compaction throughput, usually by a factor of 5 or more for a > non-trivial sized table, depending on the table size in bytes. Search for > "STCS write amplification" to understand why that's the case. That means if > you've measured the max concurrent compaction throughput is 1GB/s, your > average max insertion speed over a period of time is probably less than > 200MB/s. > > If you really decide to test the compaction bottleneck in action, it's > better to measure the table size in bytes on disk, rather than the number > of records. That's because not only the record count, but also the size of > partitions and compression ratio, all have meaningful effect on the > compaction workload. It's also worth mentioning that if using the STCS > strategy, which is more suitable for write heavy workload, you may want to > keep an eye on the SSTable data file size distribution. Initially the > compaction may not involve any large SSTable data file, so it won't be a > bottleneck at all. As more bigger SSTable data files are created over time, > they will get involved in compactions more and more frequently. The > bottleneck will only shows up (i.e. become problematic) when there's > sufficient number of large SSTable data files involved in multiple > concurrent compactions, occupying all available compactors and blocks > (queuing) a larger number of compactions involving smaller SSTable data > files. > > > Regards, > > Bowen > > > > On 22/07/2022 11:19, Pawar, Amit wrote: > > [Public] > > > > Thank you Bowen for your reply. Took some time to respond due to testing > issue. > > > > I tested again multi-threaded feature with number of records from 260 > million to 2 billion and still improvement is seen around 80% of Ramdisk > score. It is still possible that compaction can become new bottleneck and > could be new opportunity to fix it. I am newbie here and possible that I > failed to understand your suggestion completely. At-least with this > testing multi-threading benefit is reflecting in score. > > > > Do you think multi-threading is good to have now ? else please suggest if > I need to test further. > > > > Thanks, > > Amit > > > > *From:* Bowen Song via dev <dev@cassandra.apache.org> > <dev@cassandra.apache.org> > *Sent:* Wednesday, July 20, 2022 4:13 PM > *To:* dev@cassandra.apache.org > *Subject:* Re: [DISCUSS] Improve Commitlog write path > > > > [CAUTION: External Email] > > From my past experience, the bottleneck for insert heavy workload is > likely to be compaction, not commit log. You initially may see commit log > as the bottleneck when the table size is relatively small, but as the table > size increases, compaction will likely take its place and become the new > bottleneck. > > On 20/07/2022 11:11, Pawar, Amit wrote: > > [Public] > > > > Hi all, > > > > (My previous mail is not appearing in mailing list and resending again > after 2 days) > > > > Myself Amit and working at AMD Bangalore, India. I am new to Cassandra and > need to do Cassandra testing on large core systems. Usually should test on > multi-nodes Cassandra but started with Single node testing to understand > how Cassandra scales with increasing core counts. > > > > Test details: > > Operation: Insert > 90% (insert heavy) > > Operation: Scan < 10% > > Cassandra: 3.11.10 and trunk > > Benchmark: TPCx-IOT (similar to YCSB) > > > > Results shows scaling is poor beyond 16 cores and it is almost linear. > Following settings are the common settings helped to get the better scores. > > 1. Memtable heap allocation: offheap_objects > > 2. memtable_flush_writers > 4 > > 3. Java heap: 8-32GB with survivor ratio tuning > > 4. Separate storage space for Commitlog and Data. > > > > Many online blogs suggest to add new Cassandra node when unable to take > high writes. But with large systems, high writes should be easily taken due > to many cores. Need was to improve the scaling with more cores so this > suggestion didn’t help. After many rounds of testing it was observed that > current implementation uses single thread for Commitlog syncing activity. > Commitlog files are mapped using mmap system call and changes are written > with msync. Periodic syncing with JVisualvm tool shows > > 1. thread is not 100% busy with Ramdisk usage for Commitlog storage > and scaling improved on large systems. Ramdisk scores > 2 X NVME score. > > 2. thread becomes 100% busy with NVME usage for Commiglog and score > does not improve much beyond 16 cores. > > > > Linux kernel uses 4K pages for mapped memory with mmap system call. So, to > understand this further, disk I/O testing was done using FIO tool and > results shows > > 1. NVME 4K random R/W throughput is very less with single thread > and it improves with multi-threaded. > > 2. Ramdisk 4K random R/W throughput is good with single thread only > and also better with multi-threaded > > > > Based on the FIO test results following two ideas were tested for > Commitlog files with Cassandra-3.1.10 sources. > > 1. Enable Direct IO feature for Commitlog files (similar to > [CASSANDRA-14466] > Enable Direct I/O - ASF JIRA (apache.org) > <https://urldefense.com/v3/__https://nam11.safelinks.protection.outlook.com/?url=https*3A*2F*2Fissues.apache.org*2Fjira*2Fbrowse*2FCASSANDRA-14466&data=05*7C01*7CAmit.Pawar*40amd.com*7C9a32196e4eca4910d76008da6eff3695*7C3dd8961fe4884e608e11a82d994e183d*7C0*7C0*7C637944339660848831*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000*7C*7C*7C&sdata=2DQ65QUQa0grzbndirszS1U9NSIrt12JE2AiGM20lsw*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!PbtH5S7Ebw!YWBRo8WrhCQ5INoP-GR8UDj4aRbEPADrHzW3wv35NKveUhbX3Q2XS0mq5w_mliY8lFfyvgU3r_US_0EpgMDN5BIn$> > ) > > 2. Enable Multi-threaded syncing for Commitlog files. > > > > First one need to retest. Interestingly second one helped to improve the > score with “NVME” disk. NVME disk configuration score is almost within > 80-90% of ramdisk and 2 times of single threaded implementation. > Multithreading enabled by adding new thread pool in > “AbstractCommitLogSegmentManager” class and changed syncing thread as > manager thread for this new thread pool to take care synchronization. Only > tested with Cassandra-3.11.10 and needs complete testing but this change is > working in my test environment. Tried these few experiments so that I could > discuss here and seek your valuable suggestions to identify the right fix > for insert heavy workloads. > > > > 1. Is it good idea to convert single threaded syncing to > multi-threading implementation to improve the disk IO? > > 2. Direct I/O throughput is high with single thread and best fit > for Commitlog case due to file size. This will improve writes on small to > large systems. Good to bring this support for Commitlog files? > > > > Please suggest. > > > > Thanks, > > Amit Pawar > >