[
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961694#comment-16961694
]
YangSong commented on KUDU-2975:
--------------------------------
I've been looking at this recently. I quite agree with Andrew Wong. It's worth
thinking about this from a Raft perspective. One solution is keeping the
metadata directories in a single, separate directory, but spreading the WALs
across directories as Andrew Wong said. Here are some of my thoughts.
* Append thread may no longer be single thread. The number of threads may
depend on SSD disk IO performance. This assumes that there is an append thread
per disk.
* When a disk failed, we stop the thread and handle like a "data directory"
disk failure. If The WAL is missing on restart, this case is already covered in
function RunBootstrap, in the metadata we find blocks but no WAL need recover.
The current processing way is to skip this problematic tablet, and report to
master. Master will recover the tablet through raft change config. Here we may
need to deal with the failed disk rejoin to system. We may need to write the
path of WAL into the tablet's metadata.
We known SSD disk is more prone to broken than SATA. If the metadata disk
fails, all tablet on the tserver will be recovered. As an another thought
experiment, let's assume we spread the WALs and metadata across directories.
And to do that we have to be able to "remember" all the tablet that were
deleted. If a disk failed, the normal running tablets can be recovered by raft,
but the tombstoned tablet replicas can't be recovered. So we may need to
provide a special path to store the tombstoned tablet replicas's information.
But there's still no way to prevent a broken disk from affecting all the
tablets in tserver. Another possible way is for tombstoned tablet's metadata
information to be stored in the master, tombstoned tablet's metadatas on
tserver need to be requested from the master at startup. But it will undermine
independence between master and tserver. Anyway, the tombstoned tablet is
single copy, this in itself has an impact on cluster high availability.
Another potential problem is that we increase the write rate by using
multi-thread, if the write speed is greater than the flush/compaction speed, we
might need something like rocksdb's Write stall/stop. Currently one tablet only
has one memrowset, and has row locks when writing, maybe that's not a problem.
> Spread WAL across multiple data directories
> -------------------------------------------
>
> Key: KUDU-2975
> URL: https://issues.apache.org/jira/browse/KUDU-2975
> Project: Kudu
> Issue Type: New Feature
> Components: fs, tablet, tserver
> Reporter: LiFu He
> Priority: Major
> Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we
> created a big table and loaded data to it through flink. We noticed that the
> util of one SSD which is used to store WAL is 100% but others are free. So,
> we suggest to spread WAL across multiple data directories.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)