Hi Ranjan, Lari > Until then I have created a PR to resolve this disagreement with your suggestion: > PR: https://github.com/apache/pulsar/pull/23428
Make sense. I think we can close the discussion. Thanks Yubiao Feng On Thu, Oct 10, 2024 at 4:17 AM Rajan Dhabalia <dhabalia...@gmail.com> wrote: > Let me answer each question though most of them were already covered in the > previous reply. > > >> Firstly, could you submit an issue that you encountered a ledger lost > issue, and we should solve the issue > > There are multiple issues already created about the missing schema ledgers > and that could happen in the system due to multiple reasons but the > question is how to handle it. Just for your reference below are the few of > the open issues because of missing schema ledgers and duplicates are keep > getting created because of it > https://github.com/apache/pulsar/issues/20414 > https://github.com/apache/pulsar/issues/14533 > https://github.com/apache/pulsar/issues/15267 > https://github.com/apache/pulsar/issues/5792 > > > >> Secondly, the ledger containing user messages will also be lost if the > schema ledger is lost. > > yes, of course and that's why we have made brokers resilient enough to > handle this situation. and same handling should be available to broken > schema ledger as well. > > >> Thirdly, after the PR, users will encounter a issue that is more worse > than > before. They can not consume the original messages before the schema ledger > is broken. After the PR they can not consume the messages continuously > published. > > I don't think that is true. Previously connected consumers with correct > schema will fail to connect to the broker due to topic's unavailability but > once the broker handles it with its resilience nature, topic will be > available again and consumer should be able to reconnect and consume the > data. > > > Let's take a step back and understand the issue fundamentally without > jumping with predefined notion:: > > - While handling schema retrieval, the broker handles failure with > recoverable/non-recoverable exceptions. > - If error is non-recoverable, due to broker's resiliency, broker considers > that schema is already lost and the broker should move forward to create a > new schema ledger with a new producer/consumer connected. > - Now, for a second if we consider that brokers do not recover and let > system admin manually delete the schema metadata node so, broker can > recreate the schema then it will be the same thing that the broker did > during auto-recovery, We will face the same issues if we clean up schema > metadata node manually which we are talking in this email thread. Manual > clean up doesn't make any difference but yes, it makes a difference for > operational efforts. We are running with millions of topics on the Pulsar > cluster and we saw this issue happening for thousands of topics due to > some reasons and system admin can not manually fix all thousands of topics. > In that case, we depend on the broker's resiliency to recover and make the > topic available. > - so, let's be honest here, depending on manual cleanup will not help and > we will end up with the same issues which we are talking about if the > broker handles recovery. > > > >> Makes sense, just wondering if the behavior should only be enabled when > config.isSchemaLedgerForceRecovery() == true (PIP-327)? > > Yes, we can also do this, at least large scale systems with millions of > topics can use this feature to build resilience in the system. I am sure, > this will bite many users and as usual we will see this PR again by someone > in future. Until then I have created a PR to resolve this disagreement with > your suggestion: > PR: https://github.com/apache/pulsar/pull/23428 > > Thanks, > Rajan > > > > > > > On Wed, Oct 9, 2024 at 4:02 AM Lari Hotari <lhot...@apache.org> wrote: > > > Makes sense, just wondering if the behavior should only be enabled when > > config.isSchemaLedgerForceRecovery() == true (PIP-327)? > > > > -Lari > > > > On 2024/10/09 04:51:23 Rajan Dhabalia wrote: > > > >> When the schemas of a topic are lost, all of the messages in the > topic > > > can not be consumed successfully, and producers can not publish > messages > > > anymore. > > > > > > Well, there are already many issues created related to the lost schema > > > ledgers so, it is a very well known issue that the topic becomes > > > unavailable and producers are not able to publish messages anymore just > > > because of server issues which is extremely critical for mission > critical > > > usecases and not acceptable as well. Because of that brokers must be > > fault > > > tolerant to recover in case of missing ledgers. Broker already handles > > such > > > issues during manged-ledger or managed-cursor legers but it should also > > > handle in case of schema ledger as well and make sure that broker won't > > > impact topic's unavailability. It is also very important to just not > rely > > > on operational effort to fix such unavailability issues manually but > > broker > > > should have a mechanism to recover by itself, > > > Therefore, this PR is important to make sure tenants don't face topic > > > unavailability due to well known issues of missing schema ledgers. > > > > > > Thanks, > > > Rajan > > > > > > On Tue, Oct 8, 2024 at 8:29 PM Yubiao Feng > > > <yubiao.f...@streamnative.io.invalid> wrote: > > > > > > > Hi all > > > > > > > > Background: When the schemas of a topic are lost, all of the messages > > in > > > > the topic can not be consumed successfully, and producers can not > > publish > > > > messages anymore. This mechanism alerts users to try to recover their > > > > schemas or recreate their topics. > > > > > > > > https://github.com/apache/pulsar/pull/23395 added a patch: producers > > will > > > > rebuild schemas if the original schemas were lost, which will mix the > > old > > > > schema and new schema as the same schema ID. For example: > > > > - send M1 with schema `Int32`, get schema id: `1` > > > > - send M2 with schema `String`, get schema id: `2` > > > > - schemas are lost > > > > - send M3 with schema `String`, get schema id `1` > > > > > > > > The messages `M1` and `M3` use different schemas, but have the same > > schema > > > > id, but users assume all things are fine, which is dangerous. So I > > want to > > > > revert this PR. > > > > > > > > Thanks > > > > Yubiao Feng > > > > > > > > > >