Re: [DISCUSS] Improve Commitlog write path

Branimir Lambov Wed, 21 Sep 2022 04:58:50 -0700

Hello Amit,

This paper may be of interest to you:
https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf


We did a range of tests that are similar to your scenario and realized
several things early on:

   - Memory-mapping the commit log in combination with memory-mapped data
   or index files causes long msync delays. This can be solved by switching to
   a compressed log.
   - For smaller mutation sizes, using a large segment size
   practically removes the commit log bottleneck, even with compression and
   even though compression is currently single-threaded.
   - Write performance scaling with available CPU threads is limited by
   memtable congestion. Scaling can be improved by using a sharded memtable
   (introduced with CASSANDRA-17034).

Needing a compressed log to achieve the best write performance is not
ideal, and implementing a non-compressed non-memory-mapped option is
fairly easy, with or without Direct IO. If you are looking for a simple
performance improvement for the commit log, this is where I would start.

Regards,
Branimir

On Tue, Jul 26, 2022 at 3:36 PM Pawar, Amit <amit.pa...@amd.com> wrote:

> [Public]
>
>
>
> Hi Bowen,
>
>
>
> Thanks for your reply. Now it is clear that what are some benefits of this
> patch. I will send it for review once it is ready and hopefully it gets
> accepted.
>
>
>
> Thanks,
>
> Amit
>
>
>
> *From:* Bowen Song via dev <dev@cassandra.apache.org>
> *Sent:* Tuesday, July 26, 2022 5:36 PM
> *To:* dev@cassandra.apache.org
> *Subject:* Re: [DISCUSS] Improve Commitlog write path
>
>
>
> [CAUTION: External Email]
>
> Hi Amit,
>
> That's some brilliant tests you have done there. It shows that the
> compaction throughput not only can be a bottleneck on the speed of insert
> operations, but it can also stress the JVM garbage collector. As a result
> of GC pressure, it can cause other things, such as insert, to fail.
>
> Your last statement is correct. The commit log change can be beneficial
> for atypical workloads where large volume of data is getting inserted and
> then expired soon, for example when using the TimeWindowCompactionStrategy
> with short TTL. But I must point out that this kind of atypical usage is
> often an anti-pattern in Cassandra, as Cassandra is a database, not a queue
> or cache system.
>
> This, however, is not saying the commit log change should not be
> introduced. As others have pointed out, it's down to a balancing act
> between the cost and benefit, and it will depend on the code complexity and
> the effect it has on typical workload, such as CPU and JVM heap usage.
> After all, we should prioritise the performance and reliability of typical
> usage before optimising for atypical use cases.
>
> Best,
> Bowen
>
> On 26/07/2022 12:41, Pawar, Amit wrote:
>
> [Public]
>
>
>
> Hi Bowen,
>
>
>
> Thanks for the reply and it helped to identify the failure point. Tested
> compaction throughput with different values and threads active in
> compaction reports “java.lang.OutOfMemoryError: Map failed” error with 1024
> MB/s earlier compared to other values. This shows with lower throughput
> such issues are going to come up not immediately but in days or weeks. Test
> results are given below.
>
>
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | Records    | Compaction Throughput | 5 large files In GB   | Disk usage
> (GB) |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 2000000000 | 8                     | Not collected         |
> 500             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 2000000000 | 16                    | Not collected         |
> 500             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 900000000  | 64                    | 3.5,3.5,3.5,3.5,3.5   |
> 273             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 900000000  | 128                   | 3.5, 3.9,4.9,8.0, 15  |
> 287             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 900000000  | 256                   | 11,11,12,16,20        |
> 359             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 900000000  | 512                   | 14,19,23,27,28        |
> 469             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 900000000  | 1024                  | 14,18,23,27,28        |
> 458             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> | 900000000  | 0                     | 6.9,6.9,7.0,28,28     |
> 223             |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
> |            |                       |
> |                 |
>
>
> +------------+-----------------------+-----------------------+-----------------+
>
>
>
> Issues observed with increasing compaction throughput.
>
>    1. Out of memory errors
>    2. Scores reduces as throughput increased
>    3. Files size grows as throughput increased
>    4. Insert failures are noticed
>
>
>
> After this testing, I feel that this change is beneficial for workloads
> where data is not kept/left on nodes for too long. With lower throughput
> large system can ingest more data. Does it make sense ?
>
>
>
> Thanks,
>
> Amit
>
>
>
> *From:* Bowen Song via dev <dev@cassandra.apache.org>
> <dev@cassandra.apache.org>
> *Sent:* Friday, July 22, 2022 4:37 PM
> *To:* dev@cassandra.apache.org
> *Subject:* Re: [DISCUSS] Improve Commitlog write path
>
>
>
> [CAUTION: External Email]
>
> Hi Amit,
>
>
>
> The compaction bottleneck is not an instantly visible limitation. It in
> effect limits the total size of writes over a fairly long period of time,
> because compaction is asynchronous and can be queued. That means if
> compaction can't keep up with the writes, they will be queued, and
> Cassandra remains fully functional until hitting the "too many open files"
> error or the filesystem runs out of free inodes. This can happen over many
> days or even weeks.
>
> For the purpose of benchmarking, you may prefer to measure the max
> concurrent compaction throughput, instead of actually waiting for that
> breaking moment. The max write throughput is a fraction of the max
> concurrent compaction throughput, usually by a factor of 5 or more for a
> non-trivial sized table, depending on the table size in bytes. Search for
> "STCS write amplification" to understand why that's the case. That means if
> you've measured the max concurrent compaction throughput is 1GB/s, your
> average max insertion speed over a period of time is probably less than
> 200MB/s.
>
> If you really decide to test the compaction bottleneck in action, it's
> better to measure the table size in bytes on disk, rather than the number
> of records. That's because not only the record count, but also the size of
> partitions and compression ratio, all have meaningful effect on the
> compaction workload. It's also worth mentioning that if using the STCS
> strategy, which is more suitable for write heavy workload, you may want to
> keep an eye on the SSTable data file size distribution. Initially the
> compaction may not involve any large SSTable data file, so it won't be a
> bottleneck at all. As more bigger SSTable data files are created over time,
> they will get involved in compactions more and more frequently. The
> bottleneck will only shows up (i.e. become problematic) when there's
> sufficient number of large SSTable data files involved in multiple
> concurrent compactions, occupying all available compactors and blocks
> (queuing) a larger number of compactions involving smaller SSTable data
> files.
>
>
> Regards,
>
> Bowen
>
>
>
> On 22/07/2022 11:19, Pawar, Amit wrote:
>
> [Public]
>
>
>
> Thank you Bowen for your reply. Took some time to respond due to testing
> issue.
>
>
>
> I tested again multi-threaded feature with number of records from 260
> million to 2 billion and still improvement is seen around 80% of Ramdisk
> score. It is still possible that compaction can become new bottleneck and
> could be new opportunity to fix it. I am newbie here and possible that I
> failed to understand your suggestion completely.  At-least with this
> testing multi-threading benefit is reflecting in score.
>
>
>
> Do you think multi-threading is good to have now ? else please suggest if
> I need to test further.
>
>
>
> Thanks,
>
> Amit
>
>
>
> *From:* Bowen Song via dev <dev@cassandra.apache.org>
> <dev@cassandra.apache.org>
> *Sent:* Wednesday, July 20, 2022 4:13 PM
> *To:* dev@cassandra.apache.org
> *Subject:* Re: [DISCUSS] Improve Commitlog write path
>
>
>
> [CAUTION: External Email]
>
> From my past experience, the bottleneck for insert heavy workload is
> likely to be compaction, not commit log. You initially may see commit log
> as the bottleneck when the table size is relatively small, but as the table
> size increases, compaction will likely take its place and become the new
> bottleneck.
>
> On 20/07/2022 11:11, Pawar, Amit wrote:
>
> [Public]
>
>
>
> Hi all,
>
>
>
> (My previous mail is not appearing in mailing list and resending again
> after 2 days)
>
>
>
> Myself Amit and working at AMD Bangalore, India. I am new to Cassandra and
> need to do Cassandra testing on large core systems. Usually should test on
> multi-nodes Cassandra but started with Single node testing to understand
> how Cassandra scales with increasing core counts.
>
>
>
> Test details:
>
> Operation: Insert > 90% (insert heavy)
>
> Operation: Scan < 10%
>
> Cassandra: 3.11.10 and trunk
>
> Benchmark: TPCx-IOT (similar to YCSB)
>
>
>
> Results shows scaling is poor beyond 16 cores and it is almost linear.
> Following settings are the common settings helped to get the better scores.
>
> 1.       Memtable heap allocation: offheap_objects
>
> 2.       memtable_flush_writers > 4
>
> 3.       Java heap: 8-32GB with survivor ratio tuning
>
> 4.       Separate storage space for Commitlog and Data.
>
>
>
> Many online blogs suggest to add new Cassandra node when unable to take
> high writes. But with large systems, high writes should be easily taken due
> to many cores. Need was to improve the scaling with more cores so this
> suggestion didn’t help. After many rounds of testing it was observed that
> current implementation uses single thread for Commitlog syncing activity.
> Commitlog files are mapped using mmap system call and changes are written
> with msync. Periodic syncing with JVisualvm tool shows
>
> 1.       thread is not 100% busy with Ramdisk usage for Commitlog storage
> and scaling improved on large systems. Ramdisk scores > 2 X NVME score.
>
> 2.       thread becomes 100% busy with NVME usage for Commiglog and score
> does not improve much beyond 16 cores.
>
>
>
> Linux kernel uses 4K pages for mapped memory with mmap system call. So, to
> understand this further, disk I/O testing was done using FIO tool and
> results shows
>
> 1.       NVME 4K random R/W throughput is very less with single thread
> and it improves with multi-threaded.
>
> 2.       Ramdisk 4K random R/W throughput is good with single thread only
> and also better with multi-threaded
>
>
>
> Based on the FIO test results following two ideas were tested for
> Commitlog files with Cassandra-3.1.10 sources.
>
> 1.       Enable Direct IO feature for Commitlog files (similar to  
> [CASSANDRA-14466]
> Enable Direct I/O - ASF JIRA (apache.org)
> <https://urldefense.com/v3/__https://nam11.safelinks.protection.outlook.com/?url=https*3A*2F*2Fissues.apache.org*2Fjira*2Fbrowse*2FCASSANDRA-14466&data=05*7C01*7CAmit.Pawar*40amd.com*7C9a32196e4eca4910d76008da6eff3695*7C3dd8961fe4884e608e11a82d994e183d*7C0*7C0*7C637944339660848831*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000*7C*7C*7C&sdata=2DQ65QUQa0grzbndirszS1U9NSIrt12JE2AiGM20lsw*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!PbtH5S7Ebw!YWBRo8WrhCQ5INoP-GR8UDj4aRbEPADrHzW3wv35NKveUhbX3Q2XS0mq5w_mliY8lFfyvgU3r_US_0EpgMDN5BIn$>
> )
>
> 2.       Enable Multi-threaded syncing for Commitlog files.
>
>
>
> First one need to retest. Interestingly second one helped to improve the
> score with “NVME” disk. NVME disk configuration score is almost within
> 80-90% of ramdisk and 2 times of single threaded implementation.
> Multithreading enabled by adding new thread pool in
> “AbstractCommitLogSegmentManager” class and changed syncing thread as
> manager thread for this new thread pool to take care synchronization. Only
> tested with Cassandra-3.11.10 and needs complete testing but this change is
> working in my test environment. Tried these few experiments so that I could
> discuss here and seek your valuable suggestions to identify the right fix
> for insert heavy workloads.
>
>
>
> 1.       Is it good idea to convert single threaded syncing to
> multi-threading implementation to improve the disk IO?
>
> 2.       Direct I/O throughput is high with single thread and best fit
> for Commitlog case due to file size. This will improve writes on small to
> large systems. Good to bring this support for Commitlog files?
>
>
>
> Please suggest.
>
>
>
> Thanks,
>
> Amit Pawar
>
>

Re: [DISCUSS] Improve Commitlog write path

Reply via email to