merlimat commented on PR #25929:
URL: https://github.com/apache/pulsar/pull/25929#issuecomment-4625601105

   Thanks for the very thorough review — went through every finding against the 
code; all were legitimate. Addressed in 6acca7f.
   
   **On the design question (can `parallelism` change after bring-up?):** No, 
and your instinct about aborted transactions is exactly the additional 
constraint. `TxnID.mostSigBits` encodes the coordinator, so changing 
parallelism strands existing transaction ids. And because an aborted 
transaction's records are retained as long as its participating messages are, 
the value can only ever be *reduced* once every transaction created under the 
previous value has been fully cleaned up. I've documented this on the config 
and now enforce cross-broker agreement (see #5 below).
   
   **Findings:**
   
   1. **[BUG] follower-snapshot race** — confirmed. `buildAssignmentsSnapshot` 
used the cache-only `getLeaderValueIfPresent()`; switched to the async 
`getLeaderValue()` (loads from the store on a cache miss), so a 
freshly-`Following` broker no longer omits the partition. Added a short re-push 
from `ServerCnx` when a snapshot is incomplete or fails to build, so the client 
converges without relying on a later leadership change. Note: the deepest 
3+-broker scenario isn't reproducible by the 2-broker integration test, so I'm 
relying on the mechanism fix + re-push rather than an e2e proof — flagging that 
honestly.
   
   2. **[BUG] watch dies silently** — confirmed both paths. `attach()` on 
reconnect without the flag, and post-initial `onError()` (you're right that 
`ClientCnx` already removed the session so no `connectionClosed()` fires) now 
both `scheduleReconnect()`.
   
   3. **[BUG] one-shot initial open** — confirmed. The initial open now retries 
retryable failures up to the lookup deadline instead of failing `startAsync()`.
   
   4. **[QUALITY] `new URI(null)` NPE** — confirmed it's an NPE not 
`URISyntaxException`. `HandlerState.setRedirectedClusterURI` now throws a clear 
`URISyntaxException` for a blank selected URL, and `onSnapshot` isolates 
per-partition failures.
   
   5. **[QUALITY] parallelism validation/consistency** — added `minValue=1`; 
the value is persisted at `/txn/tc/parallelism` on first start and every broker 
fails fast on mismatch. New unit test.
   
   6. **[QUALITY] test can't prove the new path** — fixed; the integration test 
now asserts `isUsingMetadataDiscovery()` so a watch-path regression can't pass 
via the silent assign-topic fallback.
   
   7. **[minor]** `handlers()` now reads the field into a local; the #1 re-push 
also bounds the startup snapshot burst.
   
   All affected unit tests + checkstyle pass locally; the integration test 
compiles and is wired into the TRANSACTION suite.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to