[ 
https://issues.apache.org/jira/browse/SAMZA-334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062418#comment-14062418
 ] 

Jakob Homan commented on SAMZA-334:
-----------------------------------

bq. Chris Riccomini I'm not entirely sure SAMZA-123 would have fixed the 
problem.

It could, assuming each high volume topic is partitioned sufficiently such that 
one SSP of it is equal in message volume to a low-volume topic.  

What we're seeing is a job consuming a large number of topics (say 800), most 
of which are partitioned 8 ways.  However, a few (say 10%) of those topics are 
such high volume that we partition them much higher (say 64 ways).  This means 
that there some partitions (0-7) which have 800 SSPs associated with them and 
the rest (8-63) that have only 80 SSPs associated with them.  Therefore, these 
lower-numbered containers are handling a much higher overall volume and are 
getting hammered, particularly during peak times.  In order to have the job 
perform correctly, we have to overprovision the containers handling 8-63.

With Samza-123, we can write customer SSPGroupers that know a priori or via 
some mechanism, the relative volume of each SSP and group them accordingly.  
There's still no provision for then scheduling those groups on machines as 
optimally as possible (though this method is implemented as a non-public 
interface and so could done with relatively little work.)

> Need for asymmetric container config
> ------------------------------------
>
>                 Key: SAMZA-334
>                 URL: https://issues.apache.org/jira/browse/SAMZA-334
>             Project: Samza
>          Issue Type: Improvement
>          Components: container
>    Affects Versions: 0.8.0
>            Reporter: Chinmay Soman
>
> The current (and upcoming) partitioning scheme(s) suggest that there might be 
> a skew in the amount of data ingested and computation performed across 
> different containers for a given Samza job. This directly affects the amount 
> of resources required by a container - which today are completely symmetric.
> Case A] Partitioning on Kafka partitions
> For instance, consider a partitioner job which reads data from different 
> Kafka topics (having different partition layouts). In this case, its possible 
> that a lot of topics have a smaller number of Kafka partitions. Consequently 
> the containers processing these partitions would need more resources than 
> those responsible for the higher numbered partitions. 
> Case B] Partitioning based on Kafka topics
> Even in this case, its very easy for some containers to be doing more work 
> than others - leading to a skew in resource requirements.
> Today, the container config is based on the requirements for the worst (doing 
> the most work) container. Needless to say, this leads to resource wastage. A 
> better approach needs to consider what is the true requirement per container 
> (instead of per job).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to