zbentley commented on issue #12552:
URL: https://github.com/apache/pulsar/issues/12552#issuecomment-955748696


   The following issues were all observed in response to similar testing:
   https://github.com/apache/pulsar/issues/12557
   https://github.com/apache/pulsar/issues/12556
   https://github.com/apache/pulsar/issues/12555
   https://github.com/apache/pulsar/issues/12554
   https://github.com/apache/pulsar/issues/12553
   https://github.com/apache/pulsar/issues/12552
   https://github.com/apache/pulsar/issues/12551
   
   The condition that caused these issues to occur appears to be interaction 
with various pulsar entities (e.g. creating/deleting things in the management 
API, or attempting to create consumers) *immediately after those entities were 
created* or *immediately after entities with the same name were deleted*.
   
   I think the number of issues observed speaks to a defect in the management 
API functionality in general. Considering the severity of these issues (in many 
cases it is possible to force a topic/namespace into a permanently corrupted 
state), I hope a resolution can be found for the general/common root cause 
rather than fixing individual bug-inducing conditions.
   
   I suspect that the common root cause is that many management API operations 
are asynchronous that should not be.
   
   Ideally, the resolution of all of these issues would be the same: a 
management API operation--any operation--should not return successfully until 
all observable side effects of that operation across a Pulsar cluster 
(including brokers, proxies, bookies, and ZK) were completed. All caches of 
metadata (e.g. on all brokers/proxies in the cluster) related to the operation 
should be cleared, and all persistent state (including ledger deletion, bookie 
cleanup, ZooKeeper metadata, etc.) should be updated *during* management API 
operations, and not afterwards.
   
   If that means that management API operations take many seconds or minutes, 
that's still vastly preferable to not knowing when it is safe to interact with 
a cluster again after performing "DDL"-type changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to