[jira] [Commented] (SAMZA-82) Not use maximum number of partitions when initializing streams

Chris Riccomini (JIRA) Sat, 09 Nov 2013 21:12:50 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-82?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818360#comment-13818360
 ]


Chris Riccomini commented on SAMZA-82:
--------------------------------------

Thinking on this a bit more. One way to fix this would be to remove the 
SAMZA_PARTITION_IDS environment variable in ShellCommandConfig, and replace it 
with something like SAMZA_STREAM_PARTITIONS, or something. The 
SAMZA_PARTITION_IDS just defines which partitions a SamzaContainer is 
responsible for, but this is actually not really what we want: we want the 
stream/partition parts. If we set something like this:

{noformat}
SAMZA_STREAM_PARTITIONS=foo.bar:0,foo.bar:2,foo.baz:0
{noformat}

Then the SamzaContainer would know exactly which partitions AND streams it was 
responsible for. This would allow us to prevent TaskInstance from accidentally 
registering non-existent partitions for a topic (e.g foo.baz:2).

The partition assignment is done in two spots: SamzaAppMasterTaskManager.scala, 
and LocalJobFactory.scala. In both cases, we'd have to switch to building a set 
of all individual SystemStreamPartitions, and then do a projection of that set 
when building the SAMZA_STREAM_PARTITIONS value.

> Not use maximum number of partitions when initializing streams
> --------------------------------------------------------------
>
>                 Key: SAMZA-82
>                 URL: https://issues.apache.org/jira/browse/SAMZA-82
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>    Affects Versions: 0.7.0
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>             Fix For: 0.7.0
>
>
> Util.scala:
> {code}  /**
>    * Uses config to create SystemAdmin classes for all input stream systems to
>    * get each input stream's partition count, then returns the maximum count.
>    * An input stream with two partitions, and a second input stream with four
>    * partitions would result in this method returning 4.
>    */
>   def getMaxInputStreamPartitions(config: Config) = {
> {code}
> This approach works if all the streams have the same number of partitions, 
> but is inefficient for other cases and, where the underlying system gets 
> cranky about being asked about non-existing partitions, fails.  We should 
> eagerly figure out the correct number of partitions for each topic and pass 
> that information from there.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (SAMZA-82) Not use maximum number of partitions when initializing streams

Reply via email to