Hi Colin, Thank you for your kind and thoughtful reply.
Thank also you for clarifying why it is important to distinguish between disk problems and first boot for a log directory. I completely agree that loosing all metadata is a very serious issue and we should strive to make that as least likely to happen as possible. Currently, the storage format step is simply ensuring each log directory exists and creating a meta.properties file with clusterId and nodeId in each configured log directory. The nodeId is already a configuration property and clusterId is being proposed in this KIP as a new one. The bootstrapping information generated by the format step can optionally be made redundant. So if I understand correctly, in the scenario you describe of when disks "erroneously show up as blank", when the KafkaRaftServer starts, we are relying on the existence of this file to prevent disaster and halt the system until there is manual intervention. Currently, all the log directories must be formatted - not just the metadata directory - that is, all log directories must contain `metadata.properties`. This is validated in BrokerMetadataCheckpoint.getBrokerMetadataAndOfflineDirs. The validation that this file exists *in every log directory* is only done when the “controller” role is in effect, and that includes “broker, controller”. This means we currently require the external storage format step to run when a non metadata disk is replaced, which just seems unnecessary. Many of the ways disks fail do not enable this scenario where data is lost. The disk might be unmounted, become read-only, or otherwise generate IO failures. In any of these cases, an automatic step to format the log directory would also fail and prevent an amnesiac metadata quorum. To risk data loss in the scenario you describe we need the disk to be available and usable but also blank. I see think of use-cases here where this isn't a concern, such as a) a platform where disks are slow to be repaired and replaced or b) if the controller group is large enough to make simultaneous disk failure in a quorum highly unlikely. In such cases, a non default option to disable this metadata.properties pre-existence guard can have a net positive value. I am aware of the similar initialization steps for other systems. I’m however having some difficulty envisioning always requiring manual intervention upon disk failure in general as a desirable solution in Kafka. Not having an automated way to deal with unformatted log directories means that an operator then needs to intervene and run this command before the instance is operational again. Unless it's actually protecting the user, Kafka shouldn't be any more difficult to use than necessary. Please, let me know your thoughts on this. Best, -- Igor > On 2 Dec 2021, at 22:52, Colin McCabe <cmcc...@apache.org> wrote: > > Hi Igor, > > It is common for databases, filesystems, and other similar programs to > require a formatting step before they are used. For example, postgres > requires you to run initdb. Linux requires you to run mkfs before using a > filesystem. Windows requires you to run "format c:/", or something > equivalent. Ceph requires you to run the ceph-deploy tool or a similar tool. > It's really not a high operational burden because it only has to be done once > when the system is initialized. > > With a clearly defined initialization step, you can clearly distinguish disk > problems from simply the first startup of a cluster. This is actually quite > important to the correctness of the system. For example, if I start up two > out of three Raft nodes and their disks erroneously show up as blank, I could > elect a leader with an empty log. In that case, I've silently lost all the > metadata in the system. > > In general, there is a bootstrapping problem where brokers may not be able to > connect to the controller quorum without first having some local metadata. > For example, if you are managing users using SCRAM, the SCRAM principal for > the broker needs to exist before the connection can be made. We call this > "bootstrapping" because it requires you to "lift yourself up by your own > bootstraps." You need the metadata to fetch the metadata. The explicit > initialization step breaks the cycle and allows the cluster to be > successfully created. > > I agree that in testing, it is nice not to have to run a separate command. To > facilitate this, we could have a bash script that allows developers to start > up a single node cluster without running kafka-storage.sh. That might be > helpful. I suppose a docker image is another way to do it, which might also > help people test. > > best, > Colin > > > On Mon, Nov 29, 2021, at 12:20, Igor Soarez wrote: >> Hi all, >> >> Bumping this thread as it’s been a while. >> >> Looking forward to any kind of feedback, pease take a look. >> >> I created a short PR with a possible implementation - >> https://github.com/apache/kafka/pull/11549 >> >> -- >> Igor >> >> >> >>> On 18 Oct 2021, at 15:11, Igor Soarez <soa...@apple.com.INVALID> wrote: >>> >>> Hi all, >>> >>> I'd like to propose that we simplify the operation of KRaft servers a bit >>> by removing the requirement to run kafka-storage.sh for new storage >>> directories. >>> >>> Please take a look at the KIP and provide your feedback: >>> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-785%3A+Automatic+storage+formatting >>> >>> -- >>> Igor >>>