[
https://issues.apache.org/jira/browse/KUDU-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-2204:
------------------------------
Labels: roadmap-candidate (was: )
> Revamp storage of tablet server metadata
> ----------------------------------------
>
> Key: KUDU-2204
> URL: https://issues.apache.org/jira/browse/KUDU-2204
> Project: Kudu
> Issue Type: New Feature
> Components: consensus, tserver
> Affects Versions: 1.6.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Major
> Labels: roadmap-candidate
>
> Since the very early days of Kudu we've used the file system as a sort of
> key-value store for various bits of metadata -- in particular, each tablet
> has a "tablet metadata" file and a "consensus metadata file". We also have
> various per-disk metadata, as well as LBM block metadata, but I'll leave
> those out of the present discussion.
> Although the file system "works" and got us pretty far, there are various
> benefits we could gain by switching to something closer to a KV-store:
> - write performance: writing a single file per "record", fsyncing it, and
> renaming it to replace the old version is pretty slow. File systems
> empirically do a poor job of group-commit of multiple of these types of
> operations happening at the same time, especially as the disks get busy
> - read performance: on startup, we need to read all these metadata files, and
> using separate files means it's likely they're on entirely separate areas of
> disk. If we have, for example, 100k tablets in the future, we'll pay 100k
> random reads, which could take 10+ minutes on a cold hard disk.
> - scalability: if we want to get to 100k+ tablets per server, we start
> entering the territory where file systems are unhappy with such a high
> quantity of files
> Solving the performance problem also opens us up a lot of headroom to do
> things such as replicate cmeta information for each tablet onto two or three
> disks, which has a lot of benefits for handling disk-failure. (you can lose a
> disk without getting "amnesia" of a raft vote, for example).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)