[ 
https://issues.apache.org/jira/browse/KUDU-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963575#comment-16963575
 ] 

Andrew Wong commented on KUDU-2975:
-----------------------------------

I think the easiest path forward right now is to decouple the WAL and metadata,
 keeping the metadata in a single directory specified by the 
{{–fs_metadata_dir}}
 flag, and adding a new {{--fs_wal_dirs}} (plural) flag. We should keep around
 {{--fs_wal_dir}} for backwards compatibility, but verify that only one is set
 using a flag validator.

Here are my thoughts on a potential initial implementation:
 * Each tablet would continue to have a single WAL directory associated with it,
 and these would maybe be assigned by something like the DataDirManager (maybe
 a new WalDirManager class?) that learns what tablet replicas have their WALs
 stored in each directory, and passes that information through to
 FsManager::GetTabletWalDir() and related methods.

 * When a server first starts up, the FsManager should look for an {{instance}}
 file and {{/wals}} subdirectory in each of the specified {{--fs_wal_dirs}}
 directories. For now, let's assume all of them exist, and let's return an
 error if any don't exist (failed/missing disk tolerance can be built later;
 let's keep the initial implementation simple). This should be pretty easy to
 verify in tests, and it makes thinking about "missing" WALs easier to think
 about initially: missing WALs are still a fatal error.

 * With that as our initial assumption (that both WALs _and_ metadata must
 exist), if a tablet/consensus metadata exists for tablet {{A}} but there's no
 WAL directory for {{A}}, Kudu should crash.

 * Any kind of "funny business" should not be tolerated (e.g. multiple WAL
 directories for a single tablet should be a fatal error as well).

 * We should also update the {{fs check}} tool to accept {{--fs_wal_dirs}} and
 check for {{/wals}} in each, and the {{fs update_dirs}} tool could be updated
 to install a {{/wals}} subdirectory where appropriate.

 * We might want to assign UUIDs to each WAL directory, similar to what we do
 for the data directories and persist that in the tablet metadata. It seems more
 robust, and I explain more below.

Once that groundwork is done, we can think about implementing (and testing) the
 "trickier" parts of this, when the WAL is failed/missing:
 * What happens if a WAL disk crashes, we make a new copy for tablet {{A}} on
 the same tablet server, and Kudu places a new replica copy on the same
 server. While running, this might be fine, but if we restart the tserver and
 the bad disk is readable, we might now have _two_ WAL directories for {{A}}!
 How should we handle this?
 ** Maybe fail {{A}}? This doesn't seem very robust.
 ** Maybe we need to begin assigning UUIDs to WAL directories like we do for
 data directories, and persist the WAL directory UUID into the tablet
 metadata. We probably don't need to have all the consistency checks that we
 have for the block manager because each WAL directory is isolated. In
 general, adding UUIDs seems like it might be a good idea anyway; this
 could even be put in the initial implementation.

 * If, after we've passed the initial FsManager checks and start bootstrapping
 {{A}} and find that there is no WAL, what should we do?
 ** Failing {{A}} seems reasonable. I believe we should be safe, even from a
 Raft point of view – other replicas will see this replica as
 FAILED_UNRECOVERABLE and attempt to evict it.

 * During runtime, if we begin writing to the WAL and hit a disk failure code
 (EIO, ENODEV, etc., see util/status.h), we should fail every tablet replica
 that has its WALs in the same directory.
 ** This would probably be similar to what we do for data directory failures
 and CFile checksum failures. One difference here is that it's on the write
 hot path, and so I would be interested to see how a failure may affect
 insert performance.

I think this implementation would be relatively simple (compared storing extra
 tombstone metadata, or heartbeating tombstone statuses to the leader). It's
 also somewhat limited (metadata is still a single point of failure). But it
 gets us using multiple disks for WALs, so maybe that's good enough for now. I'm
 also open to hearing more about alternate implementations and about what edge
 cases we might see.

BTW another implementation that I've thought of is expanding the
 DataDirManager's responsibility to include WALs. That would allow this single
 "DirectoryManager" to make sure that tablet's WAL directory is a member of the
 tablet replica's directory group, so the "failure" tracking happens in a single
 entity. I don't like this approach as much now because it makes the
 responsibilities of the DirectoryManager very, very large, and it tightly
 couples the {{--fs_data_dirs}} flag with the WALs, which I don't like that
 much.

cc [~adar] and [~mpercy] since they were involved in scoping out and
 reviewing a lot of the data directory failure handling work, which might be
 relevant here.

> Spread WAL across multiple data directories
> -------------------------------------------
>
>                 Key: KUDU-2975
>                 URL: https://issues.apache.org/jira/browse/KUDU-2975
>             Project: Kudu
>          Issue Type: New Feature
>          Components: fs, tablet, tserver
>            Reporter: LiFu He
>            Priority: Major
>         Attachments: network.png, tserver-WARNING.png, util.png
>
>
> Recently, we deployed a new kudu cluster and every node has 12 SSD. Then, we 
> created a big table and loaded data to it through flink.  We noticed that the 
> util of one SSD which is used to store WAL is 100% but others are free. So, 
> we suggest to spread WAL across multiple data directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to