[ 
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961694#comment-16961694
 ] 

YangSong commented on KUDU-2975:
--------------------------------

I've been looking at this recently. I quite agree with Andrew Wong. It's worth 
thinking about this from a Raft perspective.  One solution is keeping the 
metadata directories in a single, separate directory, but spreading the WALs 
across directories as Andrew Wong said. Here are some of my thoughts.
 * Append thread may no longer be single thread. The number of threads may 
depend on SSD disk IO performance. This assumes that there is an append thread 
per disk.
 * When a disk failed, we stop the thread and handle like a "data directory" 
disk failure. If The WAL is missing on restart, this case is already covered in 
function RunBootstrap, in the metadata we find blocks but no WAL need recover. 
The current processing way is to skip this problematic tablet, and report to 
master. Master will recover the tablet through raft change config. Here we may 
need to deal with the failed disk rejoin to system. We may need to write the 
path of WAL into the tablet's metadata.

We known SSD disk is more prone to broken than SATA. If the metadata disk 
fails, all tablet on the tserver will be recovered. As an another thought 
experiment, let's assume we spread the WALs and metadata across directories. 
And to do that we have to be able to "remember" all the tablet that were 
deleted. If a disk failed, the normal running tablets can be recovered by raft, 
but the tombstoned tablet replicas can't be recovered. So we may need to 
provide a special path to store the tombstoned tablet replicas's information. 
But there's still no way to prevent a broken disk from affecting all the 
tablets in tserver. Another possible way is for tombstoned tablet's metadata 
information to be stored in the master, tombstoned tablet's metadatas on 
tserver need to be requested from the master at startup. But it will undermine 
independence between master and tserver. Anyway, the tombstoned tablet is 
single copy, this in itself has an impact on cluster high availability.

Another potential problem is that we increase the write rate by using 
multi-thread, if the write speed is greater than the flush/compaction speed, we 
might need something like rocksdb's Write stall/stop. Currently one tablet only 
has one memrowset, and has row locks when writing, maybe that's not a problem.

> Spread WAL across multiple data directories
> -------------------------------------------
>
>                 Key: KUDU-2975
>                 URL: https://issues.apache.org/jira/browse/KUDU-2975
>             Project: Kudu
>          Issue Type: New Feature
>          Components: fs, tablet, tserver
>            Reporter: LiFu He
>            Priority: Major
>         Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we 
> created a big table and loaded data to it through flink.  We noticed that the 
> util of one SSD which is used to store WAL is 100% but others are free. So, 
> we suggest to spread WAL across multiple data directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to