Hi Brian, Thanks for the KIP.
Starting the metadata fetch before we need the result is definitely a great idea. This way, the metadata fetch can be done in parallel with all the other stuff e producer is doing, rather than forcing the producer to periodically come to a halt periodically while metadata is fetched. Maybe I missed it, but there seemed to be some details missing here. When do we start the metadata fetch? For example, if topic metadata expires every 5 minutes, perhaps we should wait 4 minutes, then starting fetching the new metadata, which we would expect to arrive by the 5 minute deadline. Or perhaps we should start the fetch even earlier, around the 2.5 minute mark. In any case, there should be some discussion about what the actual policy is. Given that metadata.max.age.ms is configurable, maybe that policy needs to be expressed in terms of a percentage of the refresh period rather than in terms of an absolute delay. The KIP correctly points out that the current metadata fetching policy causes us to "[block] in a function that's advertised as asynchronous." However, the KIP doesn't seem to spell out whether we will continue to block if metadata can't be found, or if this will be abolished. Clearly, starting the metadata fetch early will reduce blocking in the common case, but will there still be blocking in the uncommon case where the early fetch doesn't succeed in time? > To address (2), the producer currently maintains an expiry threshold for > every topic, which is used to remove a topic from the working set at a > future time (currently hard-coded to 5 minutes, this should be modified to > use metadata.max.age.ms). While this does work to reduce the size of the > topic working set, the producer will continue fetching metadata for these > topics in every metadata request for the full expiry duration. This logic > can be made more intelligent by managing the expiry from when the topic > was last used, enabling the expiry duration to be reduced to improve cases > where a large number of topics are touched intermittently. Can you clarify this part a bit? It seems like we have a metadata expiration policy now for topics, and we will have one after this KIP, but they will be somewhat different. But it's not clear to me what the differences are. In general, if load is a problem, we should probably consider adding some kind of jitter on the client side. There are definitely cases where people start up a lot of clients at the same time in parallel and there is a thundering herd problem with metadata updates. Adding jitter would spread the load across time rather than creating a spike every 5 minutes in this case. best, Colin On Fri, Nov 8, 2019, at 08:59, Ismael Juma wrote: > I think this KIP affects when we block which is actually user visible > behavior. Right? > > Ismael > > On Fri, Nov 8, 2019, 8:50 AM Brian Byrne <bby...@confluent.io> wrote: > > > Hi Guozhang, > > > > Regarding metadata expiry, no access times other than the initial lookup[1] > > are used for determining when to expire producer metadata. Therefore, > > frequently used topics' metadata will be aged out and subsequently > > refreshed (in a blocking manner) every five minutes, and infrequently used > > topics will be retained for a minimum of five minutes and currently > > refetched on every metadata update during that time period. The sentence is > > suggesting that we could reduce the expiry time to improve the latter > > without affecting (rather slightly improving) the former. > > > > Keep in mind that in most all cases, I wouldn't anticipate much of a > > difference with producer behavior, and the extra logic can be implemented > > to have insignificant cost. It's the large/dynamic topic corner cases that > > we're trying to improve. > > > > It'd be convenient if the KIP is no longer necessary. You're right in that > > there's no public API changes and the behavioral changes are entirely > > internal. I'd be happy to continue the discussion around the KIP, but > > unless otherwise objected, it can be retired. > > > > [1] Not entirely accurate, it's actually the first time when the client > > calculates whether to retain the topic in its metadata. > > > > Thanks, > > Brian > > > > On Thu, Nov 7, 2019 at 4:48 PM Guozhang Wang <wangg...@gmail.com> wrote: > > > > > Hello Brian, > > > > > > Could you elaborate a bit more on this sentence: "This logic can be made > > > more intelligent by managing the expiry from when the topic was last > > used, > > > enabling the expiry duration to be reduced to improve cases where a large > > > number of topics are touched intermittently." Not sure I fully understand > > > the proposal. > > > > > > Also since now this KIP did not make any public API changes and the > > > behavioral changes are not considered a public API contract (i.e. how we > > > maintain the topic metadata in producer cache is never committed to > > users), > > > I wonder if we still need a KIP for the proposed change any more? > > > > > > > > > Guozhang > > > > > > On Thu, Nov 7, 2019 at 12:43 PM Brian Byrne <bby...@confluent.io> wrote: > > > > > > > Hello all, > > > > > > > > I'd like to propose a vote for a producer change to improve producer > > > > behavior when dealing with a large number of topics, in part by > > reducing > > > > the amount of metadata fetching performed. > > > > > > > > The full KIP is provided here: > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-526%3A+Reduce+Producer+Metadata+Lookups+for+Large+Number+of+Topics > > > > > > > > And the discussion thread: > > > > > > > > > > > > > https://lists.apache.org/thread.html/b2f8f830ef04587144cf0840c7d4811bbf0a14f3c459723dbc5acf9e@%3Cdev.kafka.apache.org%3E > > > > > > > > Thanks, > > > > Brian > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > >