[ 
https://issues.apache.org/jira/browse/SAMZA-82?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820718#comment-13820718
 ] 

Jakob Homan commented on SAMZA-82:
----------------------------------

Passing something like SAMZA_STREAM_PARTITIONS seems like the best approach.  
I'd rather we knew as soon as possible what the actual topic-partitions we're 
dealing with are, rather than having a huge set of potential topic-partition 
pairs floating throw the code paths.  The sooner we determine the actual work 
to be done, the better, particularly as we do more work in the job assignment 
phase.

I was concerned about how large the the SAMZA_STREAM_PARTITIONS env variable 
would be for jobs with large numbers of topics and/or partitions, but there 
doesn't seem to actually be a [practical limit on their 
size|http://stackoverflow.com/questions/1078031/what-is-the-maximum-size-of-an-environment-variable-value].
  Just the same, it may best to do some type of RLE on the variable, ie
{noformat}SAMZA_STREAM_PARTITIONS=foo.bar:0,2,foo.baz:0{noformat}
or
{noformat}SAMZA_STREAM_PARTITONS=foo.(bar:0,2)(.baz:0){noformat}
 

> Not use maximum number of partitions when initializing streams
> --------------------------------------------------------------
>
>                 Key: SAMZA-82
>                 URL: https://issues.apache.org/jira/browse/SAMZA-82
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>    Affects Versions: 0.7.0
>            Reporter: Jakob Homan
>            Assignee: Jakob Homan
>             Fix For: 0.7.0
>
>
> Util.scala:
> {code}  /**
>    * Uses config to create SystemAdmin classes for all input stream systems to
>    * get each input stream's partition count, then returns the maximum count.
>    * An input stream with two partitions, and a second input stream with four
>    * partitions would result in this method returning 4.
>    */
>   def getMaxInputStreamPartitions(config: Config) = {
> {code}
> This approach works if all the streams have the same number of partitions, 
> but is inefficient for other cases and, where the underlying system gets 
> cranky about being asked about non-existing partitions, fails.  We should 
> eagerly figure out the correct number of partitions for each topic and pass 
> that information from there.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to