Todd Lipcon created KUDU-2204:

             Summary: Revamp storage of tablet server metadata
                 Key: KUDU-2204
             Project: Kudu
          Issue Type: New Feature
          Components: consensus, tserver
    Affects Versions: 1.6.0
            Reporter: Todd Lipcon
            Assignee: Todd Lipcon

Since the very early days of Kudu we've used the file system as a sort of 
key-value store for various bits of metadata -- in particular, each tablet has 
a "tablet metadata" file and a "consensus metadata file". We also have various 
per-disk metadata, as well as LBM block metadata, but I'll leave those out of 
the present discussion.

Although the file system "works" and got us pretty far, there are various 
benefits we could gain by switching to something closer to a KV-store:

- write performance: writing a single file per "record", fsyncing it, and 
renaming it to replace the old version is pretty slow. File systems empirically 
do a poor job of group-commit of multiple of these types of operations 
happening at the same time, especially as the disks get busy
- read performance: on startup, we need to read all these metadata files, and 
using separate files means it's likely they're on entirely separate areas of 
disk. If we have, for example, 100k tablets in the future, we'll pay 100k 
random reads, which could take 10+ minutes on a cold hard disk.
- scalability: if we want to get to 100k+ tablets per server, we start entering 
the territory where file systems are unhappy with such a high quantity of files

Solving the performance problem also opens us up a lot of headroom to do things 
such as replicate cmeta information for each tablet onto two or three disks, 
which has a lot of benefits for handling disk-failure. (you can lose a disk 
without getting "amnesia" of a raft vote, for example).

This message was sent by Atlassian JIRA

Reply via email to