[ https://issues.apache.org/jira/browse/KAFKA-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562107#comment-14562107 ]
Ewen Cheslack-Postava commented on KAFKA-2168: ---------------------------------------------- Agreed, it's a matter of tradeoffs. [~nehanarkhede], I think the other tradeoff you didn't mention is which one is easier for the user of the API. The non-threadsafe version definitely leaves more work to the user (some of which, like correctly choosing timeouts, we know is error prone since we've had related bugs in the producer). close() should be straightforward if we go for a non-threadsafe approach -- just need to set a flag, wake up the selector, and make sure all the consumer classes obey that flag by exiting immediately. I think if we go that route, KAFKA-2123 (at a minimum adding callbacks to async commits) becomes critical since the expectation is that you handle commits in the same thread as poll(). Any unavailability of the coordinator could block processing indefinitely while waiting for a commit if you need to know that it actually completes. I think it would also require a wakeup() mechanism that doesn't exist in the public API yet. You can't always predict the necessary timeout, e.g. if you want to be able to respond to external events such as a network message. close() and wakeup() should be the only two methods that actually require thread safety guarantees. > New consumer poll() can block other calls like position(), commit(), and > close() indefinitely > --------------------------------------------------------------------------------------------- > > Key: KAFKA-2168 > URL: https://issues.apache.org/jira/browse/KAFKA-2168 > Project: Kafka > Issue Type: Bug > Components: clients, consumer > Reporter: Ewen Cheslack-Postava > Assignee: Jason Gustafson > > The new consumer is currently using very coarse-grained synchronization. For > most methods this isn't a problem since they finish quickly once the lock is > acquired, but poll() might run for a long time (and commonly will since > polling with long timeouts is a normal use case). This means any operations > invoked from another thread may block until the poll() call completes. > Some example use cases where this can be a problem: > * A shutdown hook is registered to trigger shutdown and invokes close(). It > gets invoked from another thread and blocks indefinitely. > * User wants to manage offset commit themselves in a background thread. If > the commit policy is not purely time based, it's not currently possibly to > make sure the call to commit() will be processed promptly. > Two possible solutions to this: > 1. Make sure a lock is not held during the actual select call. Since we have > multiple layers (KafkaConsumer -> NetworkClient -> Selector -> nio Selector) > this is probably hard to make work cleanly since locking is currently only > performed at the KafkaConsumer level and we'd want it unlocked around a > single line of code in Selector. > 2. Wake up the selector before synchronizing for certain operations. This > would require some additional coordination to make sure the caller of > wakeup() is able to acquire the lock promptly (instead of, e.g., the poll() > thread being woken up and then promptly reacquiring the lock with a > subsequent long poll() call). -- This message was sent by Atlassian JIRA (v6.3.4#6332)