Jake Maes created SAMZA-964:
-------------------------------

             Summary: Improve the performance of the continuous OFFSET 
checkpointing for logged stores
                 Key: SAMZA-964
                 URL: https://issues.apache.org/jira/browse/SAMZA-964
             Project: Samza
          Issue Type: Bug
            Reporter: Jake Maes
            Assignee: Jake Maes


SAMZA-905 added the capability to write the OFFSET file on every commit().

Unfortunately, the performance was a hindrance for one of our larger jobs at 
LinkedIn. The job has 10 stores, each with hundreds of partitions in their 
changelog topics. The performance problem came from 
KafkaSystemAdmin.getSystemStreamMetadata() method which:
1. Periodically refetches the topic metadata
2. Always fetches offsets twice (oldest,upcoming) for every partition

Calling this method to fetch the offsets for just a couple tasks is wasteful. 
Metadata should only be fetched if there's a problem. Doing it periodically 
doesn't help. The total number of offset fetches is S*2*T^2 where S is the 
number of stores and P is the number of tasks/changelog partitions. Since we 
only need the newest offset should require S*T offset requests. Ideally, we'd 
also parallelize these requests, but that will be an exercise for another time. 

The fix has 3 components:
1. Cache metadata more aggressively. Only expire metadata if we get Kafka 
NotLeaderForPartitionException
2. Reduce excessive Offset fetching. 
3. Do not allow unbounded exponential backoff for offset checkpointing, just 
skip the offset file. Exponential backoff can balloon the commit time and stall 
the event loop. So we will only retry up to 3 times for a max delay of 400ms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to