[ https://issues.apache.org/jira/browse/KUDU-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992838#comment-16992838 ]
Alexey Serbin commented on KUDU-3016: ------------------------------------- [~adar] thank you for pointing out to KUDU-1125. Yes, I think we should take into account a couple of things there: # Maximum size of an RPC for masters (let's assume all masters have the same setting for {{rpc_max_message_size}}) # Maximum size of a Raft batch for masters (let's assume all masters have the same setting for {{consensus_max_batch_size_bytes}}). This is already applied by the consensus queue logic when pushing updates to followers, but the setting is effective only starting with the second batch to push (i.e. the first one might be as large as it gets: https://github.com/apache/kudu/blob/22d1f66ed1b9ae70a0118fdb6d645e1899878442/src/kudu/consensus/log_cache.cc#L309-L367). For the second item, we might want to rethink its the default setting of the flag. > Catalog manager: don't lump together all updates from one tablet report > ----------------------------------------------------------------------- > > Key: KUDU-3016 > URL: https://issues.apache.org/jira/browse/KUDU-3016 > Project: Kudu > Issue Type: Improvement > Components: master > Reporter: Alexey Serbin > Assignee: Alexey Serbin > Priority: Major > Labels: scalability > > With current structure of the system tablet for rows storing metadata > information on tablets, the catalog manager can create a very large write > operation on the system tablet when processing full tablet reports sent from > tablet servers. At some point (depends on the {{\-\-rpc_max_message_size}} > setting), a tablet report received from a tablet server comes through, but > its Raft counterpart for the system tablet update doesn't because it might be > almost two times larger. If that happens, Kudu cluster becomes almost > non-functional because of self-perpetuating > accepted-huge-tablet-report-but-cannot-push-Raft-update-to-follower-masters > pattern. > The catalog manager should not lump together updates on all tablets received > from one tablet server: > https://github.com/apache/kudu/blob/3175c35c7d721aef0c4c6b358cc3b422089c1ba7/src/kudu/master/catalog_manager.cc#L4268-L4274 -- This message was sent by Atlassian Jira (v8.3.4#803005)