Hi Pulsar Community, Below are the meeting notes from today's community meeting.
Disclaimer: I am the primary author of these notes. I took the notes while participating in the meeting discussions. It is possible that I missed or misunderstood information. If something is misattributed or misrepresented, please send a correction to this list and consider updating the Google doc. Source google doc: https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE Thanks, Michael 2022/04/14, (8:30 AM PST) - Attendees: - Matteo Merli - Enrico Olivelli - Andrey Yegorov - Michael Marshall - Dave Fisher - Lari Hotari - Massimiliano Mirelli - Chris Bartholomew - Hang Chen - Aaron Williams - Nicolò Boschi - Leolinchen - Penghui Li - Discussions - Enrico: 2.10 release process. Took a while. Do we want to talk about this? For 2.11, we should try to apply the new process. Matteo: 3 months from now we can release 2.11, we’ll create the branch in 2 months. Matteo plans to set a date (by discussion on the mailing list) and wants more scrutiny on the mailing list. Dave: we should slow down cherry picking to 2.8 and 2.9, as well. Enrico: we are finding many fixes though, and for example, 2.8 has many users and many bug fixes. The cherry picked commits are all bug fixes. Michael: we should add some documentation about this to help new committers. Matteo: this documentation would help inform contributors too. Dave: where should we put this? Website? Matteo: we could also put it in the PR template. - Michael: is 2.7.5 the last 2.7 release? Matteo: could keep it open for security bug fixes, like log4shell type fixes. Lari: 2.7.5 rc 1 has test failures, so we’ll need an rc 2. The tests that are failing on 2.7.5 are passing on 2.7.4. Matteo: thinking through LTS and the cost of users to do the upgrades. There is a tension between shipping new features and how frequently users have to upgrade. One issue: the upgrade/downgrade compatibility is only guaranteed for one minor version. An LTS could help to support those users without adding features. We could offer guarantees from one LTS to the next LTS. We’d define support so users could stick with a version without worrying about getting left behind. What if we did 3.0 and 4.0 and so on are LTS, then 3.x is just for features? The guarantee then is that you can go 3.x to 4.0. Dave: what about for current users using the 2.x versions? Matteo: we can discuss how to deal with existing versions, but we also need to figure out our preferred long term solution for how to work in the future. Dave: I like the idea of guaranteeing upgrade paths. Matteo: we could try to set a timeline for major releases, not just for minor releases, e.g. every 2 years for a major release. Discusses reasons for major releases and the nuance for how we could use this. Dave: are bookkeeper upgrade and transactions the major upgrade? Matteo: I didn’t have any feature in mind. I want to give people an upgrade path and create clarity. Michael: clarifies that you could upgrade from 3.0 to 4.0 then downgrade and it’d work. Matteo: yes. Feature defaults won’t be able to change because of this. Dave: relates well to creating a road map and telling people what is coming. Enrico: creating a road map is very hard in open source. We commit things that people contribute. In the ASF projects that I work, contributions are hard to predict. Matteo: I agree it is hard to know. These major releases would be loosely timed. For example, auto partitioning is a major feature, but it is a bunch of work. Unpredictability is bad for the users. Michael: and you don’t want to create a hard upgrade path. Is it possible to use geo-replication (or something like it) to migrate clusters to simplify upgrades? Matteo: there was a green-blue deployment work in progress proposal to spin up a new cluster to slow migrate producers and consumers to new cluster. The coordination would be topic termination to switch new cluster. Not sure that it is a general solution. Michael: how would breaking changes work for the major version upgrade? Matteo: we would do a compatibility layer. Also, the pulsar protocol hasn’t broken, and we version the api in such a way that the broker/client determine if the peer supports that feature. - PRs - Lari: Merged PR (https://github.com/apache/pulsar/pull/15067) to fix ManagedCursorImpl’s mark delete update logic, but asked for Matteo’s review. Lari plans to add more tests in the coming weeks to catch regressions associated with the change. - Andrey: https://github.com/apache/pulsar/pull/15142 WIP pulsar + bk 4.15-ish. Requests review of preliminary work, mentions that there is a test failure he’s still investigating. Switched CI to use Bookkeeper 4.16-SNAPSHOT to identify needed changes. Worked on tests that broke. Some test classes were copied from bookkeeper, so he replaced those with copy/pasted new ones. The work is iterative, and there are still tests failing. Discussion with Matteo about tradeoffs for test base classes and ways to improve testing classes. Matteo says don’t worry about synchronizing tests between Pulsar/Bookkeeper. The test utilities in bookkeeper are different. Pulsar testing assumes that bookkeeper works and are meant to test usage of bookkeeper. Matteo: how far do you think you are from completion? Andrey: hard to say, tests are passing locally, but failing on remote CI. - Hang: https://github.com/apache/pulsar/issues/15111 Bookie lost data when skip write journal, Hang Chen says he has seen this many times in production. Enrico: if you don’t write to journal, this is a possible behavior. The next bookkeeper release will include a code change. Andrey: if you want to run without journal, increase write quorum. Matteo: use different racks to increase durability and decrease chance of catastrophic failure. Enrico: there are some problems in bk protocol, even if you have multiple replicas, you are going to lose data. 4.15 includes a change to the protocol for how the bookkeeper responds. This improves a fix for a specific edge case. The only fix is to upgrade. Andrey: reminder that 4.15 is in the process of being released. Matteo: is there any failure that happened during this time? Hang Chen: no failure during this time. Enrico: during recovery, the recovery tries to find missing entries in the ledger. Went on to discuss technical details of the improvement for 4.15. Matteo: the error appears strange, and the missing entries don’t seem to make sense. Mentions that rebuilding the index could be helpful. (Missed some technical details about bookkeeper, see issue for more context and discussion.)