[ 
https://issues.apache.org/jira/browse/KUDU-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223327#comment-16223327
 ] 

Todd Lipcon commented on KUDU-2204:
-----------------------------------

A few results from el6 with ext4 (with no workload running):

{code}
fsync    storage         time
yes      fs              160s
yes      rocksdb         3.05 +/- 5.34%
no       fs              0.375 +/- 1.39%
no       rocksdb         0.292 +/- 5.69%
{code}

Note  that the "fsync with fs storage" option took a ridiculously long time. In 
fact it even caused a kernel panic the first time I run it (under perf stat) 
due to some apparent bug.


> Revamp storage of tablet server metadata
> ----------------------------------------
>
>                 Key: KUDU-2204
>                 URL: https://issues.apache.org/jira/browse/KUDU-2204
>             Project: Kudu
>          Issue Type: New Feature
>          Components: consensus, tserver
>    Affects Versions: 1.6.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> Since the very early days of Kudu we've used the file system as a sort of 
> key-value store for various bits of metadata -- in particular, each tablet 
> has a "tablet metadata" file and a "consensus metadata file". We also have 
> various per-disk metadata, as well as LBM block metadata, but I'll leave 
> those out of the present discussion.
> Although the file system "works" and got us pretty far, there are various 
> benefits we could gain by switching to something closer to a KV-store:
> - write performance: writing a single file per "record", fsyncing it, and 
> renaming it to replace the old version is pretty slow. File systems 
> empirically do a poor job of group-commit of multiple of these types of 
> operations happening at the same time, especially as the disks get busy
> - read performance: on startup, we need to read all these metadata files, and 
> using separate files means it's likely they're on entirely separate areas of 
> disk. If we have, for example, 100k tablets in the future, we'll pay 100k 
> random reads, which could take 10+ minutes on a cold hard disk.
> - scalability: if we want to get to 100k+ tablets per server, we start 
> entering the territory where file systems are unhappy with such a high 
> quantity of files
> Solving the performance problem also opens us up a lot of headroom to do 
> things such as replicate cmeta information for each tablet onto two or three 
> disks, which has a lot of benefits for handling disk-failure. (you can lose a 
> disk without getting "amnesia" of a raft vote, for example).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to