Re: [DISCUSS] Revert the PR 23395, which broke the behavior of schemas

Rajan Dhabalia Wed, 09 Oct 2024 13:17:29 -0700

Let me answer each question though most of them were already covered in the
previous reply.


>> Firstly, could you submit an issue that you encountered a ledger lost
issue, and we should solve the issue

There are multiple issues already created about the missing schema ledgers
and that could happen in the system due to multiple reasons but the
question is how to handle it. Just for your reference below are the few of
the open issues because of missing schema ledgers and duplicates are keep
getting created because of it
https://github.com/apache/pulsar/issues/20414
https://github.com/apache/pulsar/issues/14533
https://github.com/apache/pulsar/issues/15267
https://github.com/apache/pulsar/issues/5792


>> Secondly, the ledger containing user messages will also be lost if the
schema ledger is lost.

yes, of course and that's why we have made brokers resilient enough to
handle this situation. and same handling should be available to broken
schema ledger as well.

>> Thirdly, after the PR, users will encounter a issue that is more worse
than
before. They can not consume the original messages before the schema ledger
is broken. After the PR they can not consume the messages continuously
published.

I don't think that is true. Previously connected consumers with correct
schema will fail to connect to the broker due to topic's unavailability but
once the broker handles it with its resilience nature, topic will be
available again and consumer should be able to reconnect and consume the
data.


Let's take a step back and understand the issue fundamentally without
jumping with predefined notion::

- While handling schema retrieval, the broker handles failure with
recoverable/non-recoverable exceptions.
- If error is non-recoverable, due to broker's resiliency, broker considers
that schema is already lost and the broker should move forward to create a
new schema ledger with a new producer/consumer connected.
- Now, for a second if we consider that brokers do not recover and let
system admin manually delete the schema metadata node so, broker can
recreate the schema then it will be the same thing that the broker did
during auto-recovery, We will face the same issues if we clean up schema
metadata node manually which we are talking in this email thread. Manual
clean up doesn't make any difference but yes, it makes a difference for
operational efforts. We are running with millions of topics on the Pulsar
cluster and we saw this issue happening for thousands of topics  due to
some reasons and system admin can not manually fix all thousands of topics.
In that case, we depend on the broker's resiliency to recover and make the
topic available.
- so, let's be honest here, depending on manual cleanup will not help and
we will end up with the same issues which we are talking about if the
broker handles recovery.


>> Makes sense, just wondering if the behavior should only be enabled when
config.isSchemaLedgerForceRecovery() == true (PIP-327)?

Yes, we can also do this, at least  large scale systems with millions of
topics can use this feature to build resilience in the system. I am sure,
this will bite many users and as usual we will see this PR again by someone
in future. Until then I have created a PR to resolve this disagreement with
your suggestion:
PR: https://github.com/apache/pulsar/pull/23428

Thanks,
Rajan






On Wed, Oct 9, 2024 at 4:02 AM Lari Hotari <lhot...@apache.org> wrote:

> Makes sense, just wondering if the behavior should only be enabled when
> config.isSchemaLedgerForceRecovery() == true (PIP-327)?
>
> -Lari
>
> On 2024/10/09 04:51:23 Rajan Dhabalia wrote:
> > >> When the schemas of a topic are lost, all of the messages in the topic
> > can not be consumed successfully, and producers can not publish messages
> > anymore.
> >
> > Well, there are already many issues created related to the lost schema
> > ledgers so, it is a very well known issue that the topic becomes
> > unavailable and producers are not able to publish messages anymore just
> > because of server issues which is extremely critical for mission critical
> > usecases and not acceptable as well. Because of that brokers must be
> fault
> > tolerant to recover in case of missing ledgers. Broker already handles
> such
> > issues during manged-ledger or managed-cursor legers but it should also
> > handle in case of schema ledger as well and make sure that broker won't
> > impact topic's unavailability. It is also very important to just not rely
> > on operational effort to fix such unavailability issues manually but
> broker
> > should have a mechanism to recover by itself,
> > Therefore, this PR is important to make sure tenants don't face topic
> > unavailability due to well known issues of missing schema ledgers.
> >
> > Thanks,
> > Rajan
> >
> > On Tue, Oct 8, 2024 at 8:29 PM Yubiao Feng
> > <yubiao.f...@streamnative.io.invalid> wrote:
> >
> > > Hi all
> > >
> > > Background: When the schemas of a topic are lost, all of the messages
> in
> > > the topic can not be consumed successfully, and producers can not
> publish
> > > messages anymore. This mechanism alerts users to try to recover their
> > > schemas or recreate their topics.
> > >
> > > https://github.com/apache/pulsar/pull/23395 added a patch: producers
> will
> > > rebuild schemas if the original schemas were lost, which will mix the
> old
> > > schema and new schema as the same schema ID. For example:
> > > - send M1 with schema `Int32`, get schema id: `1`
> > > - send M2 with schema `String`, get schema id: `2`
> > > - schemas are lost
> > > - send M3 with schema `String`, get schema id `1`
> > >
> > > The messages `M1` and `M3` use different schemas, but have the same
> schema
> > > id, but users assume all things are fine, which is dangerous. So I
> want to
> > > revert this PR.
> > >
> > > Thanks
> > > Yubiao Feng
> > >
> >
>

Re: [DISCUSS] Revert the PR 23395, which broke the behavior of schemas

Reply via email to