lakshmi-manasa-g opened a new pull request #1580:
URL: https://github.com/apache/samza/pull/1580


   **Feature:** Elasticity (SAMZA-2687) for a Samza job allows job to have more 
tasks than the number of input SystemStreamPartition(SSP). Thus, a job can 
scale up beyond its input partition count without needing the repartition the 
input stream.
   This is achieved by having elastic tasks which is the same as a task for all 
practical purposes. But an elastic task consumes only a subset of the messages 
of an SSP.
   With an elasticity factor F (integer), the number of elastic tasks will be F 
times N with N = original task count.
   The F elastic tasks per original task all consume subsets of same SSP as the 
original task. There will be F subsets (aka key bucket) per SSP and a message 
falls into an SSP bucket ā€˜i’ if its message.key.hash()%F == i.
   
   **Previous PR** = https://github.com/apache/samza/pull/1576 (but that commit 
is part of this pr too.. will get removed when first pr merged and this pr 
branch is rebased)
   
   **Changes:**
   
   1. update SSP groupers GroupByPartition and GroupBySystemStreamPartition to 
generate F X N (elastic) tasks where F = elasticity factor and N = original 
(without elasticity) task count
   2. update SamzaContainer and RunLoopFactory to pass the elasticity factor to 
RunLoop and TaskInstance
   3. Update RunLoop to use elasticity factor based keyBucket of the input SSP 
of the incoming envelope to find the (elastic) task for the envelope.
   4. Update TaskInstance to use SSP with KeyBucket for its book keeping of 
SSPs caught up and so on.
   5. Update SystemConsumers to shield all the underlying SystemConsumer of the 
input systems from the KeyBucket concept. aka SystemConsumers 
(ConsumerMultiplexer) takes the SSP with KeyBucket and passes down only the SSP 
(without keyBucket) to the system consumer.
   
   Currently, not supported when elasticity is enabled ->
   
   1. checkpoints and prev offsets (aka elasticity currently only works with 
reset offset), broadcast stream, ssp groupers other than GroupBySSP and 
GroupByPartition.
   2. Metrics - task and ssp level metrics. container might not report correct 
values (for ex, chose_object will account for all messages in the SSP but 
should correspond to only the messages with the SSP KeyBucket processed by the 
container)
   Additional PRs will be raised for this.
   
   **Tests:** <TBD> adding new tests
   
   **API Changes:** no public API changes
   **Upgrade Instructions:** N/A
   **Usage Instructions:** set the config job.elasticity.factor > 1 to enable 
elasticity for the job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to