Re: [DISCUSS] Revert the PR 23395, which broke the behavior of schemas

Yubiao Feng Thu, 10 Oct 2024 18:01:40 -0700

Hi Ranjan, Lari

> Until then I have created a PR to resolve this
 disagreement with your suggestion:
> PR: https://github.com/apache/pulsar/pull/23428


Make sense. I think we can close the discussion.

Thanks
Yubiao Feng

On Thu, Oct 10, 2024 at 4:17 AM Rajan Dhabalia <dhabalia...@gmail.com>
wrote:

> Let me answer each question though most of them were already covered in the
> previous reply.
>
> >> Firstly, could you submit an issue that you encountered a ledger lost
> issue, and we should solve the issue
>
> There are multiple issues already created about the missing schema ledgers
> and that could happen in the system due to multiple reasons but the
> question is how to handle it. Just for your reference below are the few of
> the open issues because of missing schema ledgers and duplicates are keep
> getting created because of it
> https://github.com/apache/pulsar/issues/20414
> https://github.com/apache/pulsar/issues/14533
> https://github.com/apache/pulsar/issues/15267
> https://github.com/apache/pulsar/issues/5792
>
>
> >> Secondly, the ledger containing user messages will also be lost if the
> schema ledger is lost.
>
> yes, of course and that's why we have made brokers resilient enough to
> handle this situation. and same handling should be available to broken
> schema ledger as well.
>
> >> Thirdly, after the PR, users will encounter a issue that is more worse
> than
> before. They can not consume the original messages before the schema ledger
> is broken. After the PR they can not consume the messages continuously
> published.
>
> I don't think that is true. Previously connected consumers with correct
> schema will fail to connect to the broker due to topic's unavailability but
> once the broker handles it with its resilience nature, topic will be
> available again and consumer should be able to reconnect and consume the
> data.
>
>
> Let's take a step back and understand the issue fundamentally without
> jumping with predefined notion::
>
> - While handling schema retrieval, the broker handles failure with
> recoverable/non-recoverable exceptions.
> - If error is non-recoverable, due to broker's resiliency, broker considers
> that schema is already lost and the broker should move forward to create a
> new schema ledger with a new producer/consumer connected.
> - Now, for a second if we consider that brokers do not recover and let
> system admin manually delete the schema metadata node so, broker can
> recreate the schema then it will be the same thing that the broker did
> during auto-recovery, We will face the same issues if we clean up schema
> metadata node manually which we are talking in this email thread. Manual
> clean up doesn't make any difference but yes, it makes a difference for
> operational efforts. We are running with millions of topics on the Pulsar
> cluster and we saw this issue happening for thousands of topics  due to
> some reasons and system admin can not manually fix all thousands of topics.
> In that case, we depend on the broker's resiliency to recover and make the
> topic available.
> - so, let's be honest here, depending on manual cleanup will not help and
> we will end up with the same issues which we are talking about if the
> broker handles recovery.
>
>
> >> Makes sense, just wondering if the behavior should only be enabled when
> config.isSchemaLedgerForceRecovery() == true (PIP-327)?
>
> Yes, we can also do this, at least  large scale systems with millions of
> topics can use this feature to build resilience in the system. I am sure,
> this will bite many users and as usual we will see this PR again by someone
> in future. Until then I have created a PR to resolve this disagreement with
> your suggestion:
> PR: https://github.com/apache/pulsar/pull/23428
>
> Thanks,
> Rajan
>
>
>
>
>
>
> On Wed, Oct 9, 2024 at 4:02 AM Lari Hotari <lhot...@apache.org> wrote:
>
> > Makes sense, just wondering if the behavior should only be enabled when
> > config.isSchemaLedgerForceRecovery() == true (PIP-327)?
> >
> > -Lari
> >
> > On 2024/10/09 04:51:23 Rajan Dhabalia wrote:
> > > >> When the schemas of a topic are lost, all of the messages in the
> topic
> > > can not be consumed successfully, and producers can not publish
> messages
> > > anymore.
> > >
> > > Well, there are already many issues created related to the lost schema
> > > ledgers so, it is a very well known issue that the topic becomes
> > > unavailable and producers are not able to publish messages anymore just
> > > because of server issues which is extremely critical for mission
> critical
> > > usecases and not acceptable as well. Because of that brokers must be
> > fault
> > > tolerant to recover in case of missing ledgers. Broker already handles
> > such
> > > issues during manged-ledger or managed-cursor legers but it should also
> > > handle in case of schema ledger as well and make sure that broker won't
> > > impact topic's unavailability. It is also very important to just not
> rely
> > > on operational effort to fix such unavailability issues manually but
> > broker
> > > should have a mechanism to recover by itself,
> > > Therefore, this PR is important to make sure tenants don't face topic
> > > unavailability due to well known issues of missing schema ledgers.
> > >
> > > Thanks,
> > > Rajan
> > >
> > > On Tue, Oct 8, 2024 at 8:29 PM Yubiao Feng
> > > <yubiao.f...@streamnative.io.invalid> wrote:
> > >
> > > > Hi all
> > > >
> > > > Background: When the schemas of a topic are lost, all of the messages
> > in
> > > > the topic can not be consumed successfully, and producers can not
> > publish
> > > > messages anymore. This mechanism alerts users to try to recover their
> > > > schemas or recreate their topics.
> > > >
> > > > https://github.com/apache/pulsar/pull/23395 added a patch: producers
> > will
> > > > rebuild schemas if the original schemas were lost, which will mix the
> > old
> > > > schema and new schema as the same schema ID. For example:
> > > > - send M1 with schema `Int32`, get schema id: `1`
> > > > - send M2 with schema `String`, get schema id: `2`
> > > > - schemas are lost
> > > > - send M3 with schema `String`, get schema id `1`
> > > >
> > > > The messages `M1` and `M3` use different schemas, but have the same
> > schema
> > > > id, but users assume all things are fine, which is dangerous. So I
> > want to
> > > > revert this PR.
> > > >
> > > > Thanks
> > > > Yubiao Feng
> > > >
> > >
> >
>

Re: [DISCUSS] Revert the PR 23395, which broke the behavior of schemas

Reply via email to