[ https://issues.apache.org/jira/browse/KUDU-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223327#comment-16223327 ]
Todd Lipcon commented on KUDU-2204: ----------------------------------- A few results from el6 with ext4 (with no workload running): {code} fsync storage time yes fs 160s yes rocksdb 3.05 +/- 5.34% no fs 0.375 +/- 1.39% no rocksdb 0.292 +/- 5.69% {code} Note that the "fsync with fs storage" option took a ridiculously long time. In fact it even caused a kernel panic the first time I run it (under perf stat) due to some apparent bug. > Revamp storage of tablet server metadata > ---------------------------------------- > > Key: KUDU-2204 > URL: https://issues.apache.org/jira/browse/KUDU-2204 > Project: Kudu > Issue Type: New Feature > Components: consensus, tserver > Affects Versions: 1.6.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > > Since the very early days of Kudu we've used the file system as a sort of > key-value store for various bits of metadata -- in particular, each tablet > has a "tablet metadata" file and a "consensus metadata file". We also have > various per-disk metadata, as well as LBM block metadata, but I'll leave > those out of the present discussion. > Although the file system "works" and got us pretty far, there are various > benefits we could gain by switching to something closer to a KV-store: > - write performance: writing a single file per "record", fsyncing it, and > renaming it to replace the old version is pretty slow. File systems > empirically do a poor job of group-commit of multiple of these types of > operations happening at the same time, especially as the disks get busy > - read performance: on startup, we need to read all these metadata files, and > using separate files means it's likely they're on entirely separate areas of > disk. If we have, for example, 100k tablets in the future, we'll pay 100k > random reads, which could take 10+ minutes on a cold hard disk. > - scalability: if we want to get to 100k+ tablets per server, we start > entering the territory where file systems are unhappy with such a high > quantity of files > Solving the performance problem also opens us up a lot of headroom to do > things such as replicate cmeta information for each tablet onto two or three > disks, which has a lot of benefits for handling disk-failure. (you can lose a > disk without getting "amnesia" of a raft vote, for example). -- This message was sent by Atlassian JIRA (v6.4.14#64029)