Hi Colin,

Thank you for your kind and thoughtful reply.

Thank also you for clarifying why it is important to distinguish between disk 
problems and first boot for a log directory. I completely agree that loosing 
all metadata is a very serious issue and we should strive to make that as least 
likely to happen as possible.

Currently, the storage format step is simply ensuring each log directory exists 
and creating a meta.properties file with clusterId and nodeId in each 
configured log directory. The nodeId is already a configuration property and 
clusterId is being proposed in this KIP as a new one. The bootstrapping 
information generated by the format step can optionally be made redundant. So 
if I understand correctly, in the scenario you describe of when disks 
"erroneously show up as blank", when the KafkaRaftServer starts, we are relying 
on the existence of this file to prevent disaster and halt the system until 
there is manual intervention.

Currently, all the log directories must be formatted - not just the metadata 
directory - that is, all log directories must contain `metadata.properties`. 
This is validated in BrokerMetadataCheckpoint.getBrokerMetadataAndOfflineDirs. 
The validation that this file exists *in every log directory* is only done when 
the “controller” role is in effect, and that includes “broker, controller”. 
This means we currently require the external storage format step to run when a 
non metadata disk is replaced, which just seems unnecessary.

Many of the ways disks fail do not enable this scenario where data is lost. The 
disk might be unmounted, become read-only, or otherwise generate IO failures. 
In any of these cases, an automatic step to format the log directory would also 
fail and prevent an amnesiac metadata quorum. To risk data loss in the scenario 
you describe we need the disk to be available and usable but also blank. I see 
think of use-cases here where this isn't a concern, such as a) a platform where 
disks are slow to be repaired and replaced or b) if the controller group is 
large enough to make simultaneous disk failure in a quorum highly unlikely. In 
such cases, a non default option to disable this metadata.properties 
pre-existence guard can have a net positive value.

I am aware of the similar initialization steps for other systems. I’m however 
having some difficulty envisioning always requiring manual intervention upon 
disk failure in general as a desirable solution in Kafka. Not having an 
automated way to deal with unformatted log directories means that an operator 
then needs to intervene and run this command before the instance is operational 
again. Unless it's actually protecting the user, Kafka shouldn't be any more 
difficult to use than necessary.

Please, let me know your thoughts on this.

Best,

--
Igor


> On 2 Dec 2021, at 22:52, Colin McCabe <cmcc...@apache.org> wrote:
> 
> Hi Igor,
> 
> It is common for databases, filesystems, and other similar programs to 
> require a formatting step before they are used. For example, postgres 
> requires you to run initdb. Linux requires you to run mkfs before using a 
> filesystem. Windows requires you to run "format c:/", or something 
> equivalent. Ceph requires you to run the ceph-deploy tool or a similar tool. 
> It's really not a high operational burden because it only has to be done once 
> when the system is initialized.
> 
> With a clearly defined initialization step, you can clearly distinguish disk 
> problems from simply the first startup of a cluster. This is actually quite 
> important to the correctness of the system. For example, if I start up two 
> out of three Raft nodes and their disks erroneously show up as blank, I could 
> elect a leader with an empty log. In that case, I've silently lost all the 
> metadata in the system.
> 
> In general, there is a bootstrapping problem where brokers may not be able to 
> connect to the controller quorum without first having some local metadata. 
> For example, if you are managing users using SCRAM, the SCRAM principal for 
> the broker needs to exist before the connection can be made. We call this 
> "bootstrapping" because it requires you to "lift yourself up by your own 
> bootstraps." You need the metadata to fetch the metadata. The explicit 
> initialization step breaks the cycle and allows the cluster to be 
> successfully created.
> 
> I agree that in testing, it is nice not to have to run a separate command. To 
> facilitate this, we could have a bash script that allows developers to start 
> up a single node cluster without running kafka-storage.sh. That might be 
> helpful. I suppose a docker image is another way to do it, which might also 
> help people test.
> 
> best,
> Colin
> 
> 
> On Mon, Nov 29, 2021, at 12:20, Igor Soarez wrote:
>> Hi all,
>> 
>> Bumping this thread as it’s been a while.
>> 
>> Looking forward to any kind of feedback, pease take a look.
>> 
>> I created a short PR with a possible implementation - 
>> https://github.com/apache/kafka/pull/11549
>> 
>> --
>> Igor
>> 
>> 
>> 
>>> On 18 Oct 2021, at 15:11, Igor Soarez <soa...@apple.com.INVALID> wrote:
>>> 
>>> Hi all,
>>> 
>>> I'd like to propose that we simplify the operation of KRaft servers a bit 
>>> by removing the requirement to run kafka-storage.sh for new storage 
>>> directories.
>>> 
>>> Please take a look at the KIP and provide your feedback:
>>> 
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-785%3A+Automatic+storage+formatting
>>> 
>>> --
>>> Igor
>>> 

Reply via email to