mynameborat opened a new pull request, #1679:
URL: https://github.com/apache/samza/pull/1679

   **Problem**: Intermediate stream partition count inference logic caps the 
partition size to 256 resulting in imbalances in work assignments to tasks
   
   **Description**: As part of the intermediate partition size inference logic, 
we currently employ the following algorithm.
   
   - partitionCount = Math.max(maxPartitionSize(inputStreams), 
maxPartitionSize(outputStreams))
   - cap the partitionCount to MAX_INFERRED_PARTITIONS defined in the 
`IntermediateStreamManager` which is 256
   - apply the inferred partition count to intermediate streams whose partition 
count is uninitialized
   
   The logic above always caps the partition size of intermediate streams to 
256 for all auto-created intermediate streams. This can prevent the job from 
scaling up uniformly as the intermediate partition assignment is capped to 256 
tasks thereby rendering other tasks imbalanced in case of number tasks > 256.
   
   **Changes**:
   - Apply the cap only for batch mode as 256 limit was introduced for batch 
mode where number of files (partition) could be large
   - Add unit tests for `IntermediateStreamManager`
   - Minor java doc fix for `DefaultTaskExecutorFactory`
   
   **Tests**: Added unit tests for the code changes
   
   **API Changes**: None
   
   **Upgrade Instructions**:
   - Jobs that are temporarily worked around this constraint by setting 
`job.intermediate.stream.partitions` should remove the configuration in order 
for samza to infer and apply the partition count as described above
   - Jobs that don't use `job.intermediate.stream.partitions` need no changes.
   
   **Usage Instructions**: Refer to upgrade instruction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to