[
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963575#comment-16963575
]
Andrew Wong commented on KUDU-2975:
-----------------------------------
I think the easiest path forward right now is to decouple the WAL and metadata,
keeping the metadata in a single directory specified by the
{{–fs_metadata_dir}}
flag, and adding a new {{--fs_wal_dirs}} (plural) flag. We should keep around
{{--fs_wal_dir}} for backwards compatibility, but verify that only one is set
using a flag validator.
Here are my thoughts on a potential initial implementation:
* Each tablet would continue to have a single WAL directory associated with it,
and these would maybe be assigned by something like the DataDirManager (maybe
a new WalDirManager class?) that learns what tablet replicas have their WALs
stored in each directory, and passes that information through to
FsManager::GetTabletWalDir() and related methods.
* When a server first starts up, the FsManager should look for an {{instance}}
file and {{/wals}} subdirectory in each of the specified {{--fs_wal_dirs}}
directories. For now, let's assume all of them exist, and let's return an
error if any don't exist (failed/missing disk tolerance can be built later;
let's keep the initial implementation simple). This should be pretty easy to
verify in tests, and it makes thinking about "missing" WALs easier to think
about initially: missing WALs are still a fatal error.
* With that as our initial assumption (that both WALs _and_ metadata must
exist), if a tablet/consensus metadata exists for tablet {{A}} but there's no
WAL directory for {{A}}, Kudu should crash.
* Any kind of "funny business" should not be tolerated (e.g. multiple WAL
directories for a single tablet should be a fatal error as well).
* We should also update the {{fs check}} tool to accept {{--fs_wal_dirs}} and
check for {{/wals}} in each, and the {{fs update_dirs}} tool could be updated
to install a {{/wals}} subdirectory where appropriate.
* We might want to assign UUIDs to each WAL directory, similar to what we do
for the data directories and persist that in the tablet metadata. It seems more
robust, and I explain more below.
Once that groundwork is done, we can think about implementing (and testing) the
"trickier" parts of this, when the WAL is failed/missing:
* What happens if a WAL disk crashes, we make a new copy for tablet {{A}} on
the same tablet server, and Kudu places a new replica copy on the same
server. While running, this might be fine, but if we restart the tserver and
the bad disk is readable, we might now have _two_ WAL directories for {{A}}!
How should we handle this?
** Maybe fail {{A}}? This doesn't seem very robust.
** Maybe we need to begin assigning UUIDs to WAL directories like we do for
data directories, and persist the WAL directory UUID into the tablet
metadata. We probably don't need to have all the consistency checks that we
have for the block manager because each WAL directory is isolated. In
general, adding UUIDs seems like it might be a good idea anyway; this
could even be put in the initial implementation.
* If, after we've passed the initial FsManager checks and start bootstrapping
{{A}} and find that there is no WAL, what should we do?
** Failing {{A}} seems reasonable. I believe we should be safe, even from a
Raft point of view – other replicas will see this replica as
FAILED_UNRECOVERABLE and attempt to evict it.
* During runtime, if we begin writing to the WAL and hit a disk failure code
(EIO, ENODEV, etc., see util/status.h), we should fail every tablet replica
that has its WALs in the same directory.
** This would probably be similar to what we do for data directory failures
and CFile checksum failures. One difference here is that it's on the write
hot path, and so I would be interested to see how a failure may affect
insert performance.
I think this implementation would be relatively simple (compared storing extra
tombstone metadata, or heartbeating tombstone statuses to the leader). It's
also somewhat limited (metadata is still a single point of failure). But it
gets us using multiple disks for WALs, so maybe that's good enough for now. I'm
also open to hearing more about alternate implementations and about what edge
cases we might see.
BTW another implementation that I've thought of is expanding the
DataDirManager's responsibility to include WALs. That would allow this single
"DirectoryManager" to make sure that tablet's WAL directory is a member of the
tablet replica's directory group, so the "failure" tracking happens in a single
entity. I don't like this approach as much now because it makes the
responsibilities of the DirectoryManager very, very large, and it tightly
couples the {{--fs_data_dirs}} flag with the WALs, which I don't like that
much.
cc [~adar] and [~mpercy] since they were involved in scoping out and
reviewing a lot of the data directory failure handling work, which might be
relevant here.
> Spread WAL across multiple data directories
> -------------------------------------------
>
> Key: KUDU-2975
> URL: https://issues.apache.org/jira/browse/KUDU-2975
> Project: Kudu
> Issue Type: New Feature
> Components: fs, tablet, tserver
> Reporter: LiFu He
> Priority: Major
> Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we
> created a big table and loaded data to it through flink. We noticed that the
> util of one SSD which is used to store WAL is 100% but others are free. So,
> we suggest to spread WAL across multiple data directories.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)