Jake Maes created SAMZA-964:
-------------------------------
Summary: Improve the performance of the continuous OFFSET
checkpointing for logged stores
Key: SAMZA-964
URL: https://issues.apache.org/jira/browse/SAMZA-964
Project: Samza
Issue Type: Bug
Reporter: Jake Maes
Assignee: Jake Maes
SAMZA-905 added the capability to write the OFFSET file on every commit().
Unfortunately, the performance was a hindrance for one of our larger jobs at
LinkedIn. The job has 10 stores, each with hundreds of partitions in their
changelog topics. The performance problem came from
KafkaSystemAdmin.getSystemStreamMetadata() method which:
1. Periodically refetches the topic metadata
2. Always fetches offsets twice (oldest,upcoming) for every partition
Calling this method to fetch the offsets for just a couple tasks is wasteful.
Metadata should only be fetched if there's a problem. Doing it periodically
doesn't help. The total number of offset fetches is S*2*T^2 where S is the
number of stores and P is the number of tasks/changelog partitions. Since we
only need the newest offset should require S*T offset requests. Ideally, we'd
also parallelize these requests, but that will be an exercise for another time.
The fix has 3 components:
1. Cache metadata more aggressively. Only expire metadata if we get Kafka
NotLeaderForPartitionException
2. Reduce excessive Offset fetching.
3. Do not allow unbounded exponential backoff for offset checkpointing, just
skip the offset file. Exponential backoff can balloon the commit time and stall
the event loop. So we will only retry up to 3 times for a max delay of 400ms
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)