[jira] [Commented] (BEAM-3542) SamzaPipelineOptions probably shouldn't need maxSourceParallelism

2018-02-01 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349084#comment-16349084
 ] 

Kenneth Knowles commented on BEAM-3542:
---

[~chamikara] commenting on the PR and this is a good place to continue the 
discussion. Have you seen the thread about "hints" on the dev@ list? This 
sounds like it might be one of the use cases for that sort of thing.

> SamzaPipelineOptions probably shouldn't need maxSourceParallelism
> -
>
> Key: BEAM-3542
> URL: https://issues.apache.org/jira/browse/BEAM-3542
> Project: Beam
>  Issue Type: Bug
>  Components: runner-samza
>Reporter: Kenneth Knowles
>Priority: Minor
>
> Let's continue to examine and make sure the runner is using unbounded sources 
> in a Beam-ish consistent way. If it is necessary, that is OK too, but it 
> seemed there might be things to clarify since the code review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3542) SamzaPipelineOptions probably shouldn't need maxSourceParallelism

2018-02-01 Thread Xinyu Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349013#comment-16349013
 ] 

Xinyu Liu commented on BEAM-3542:
-

Let me add more details about this: maxSourceParallelism is currently used 
directly as the desiredNumSplits to split an UnboundedSource (close to 
FlinkPipelineOptions.parallelism). Based on the number of splits, Samza will 
create the same number of tasks, which is the logical parallelism in Samza. The 
runtime parallelism in Samza is defined by container count, thread pool size 
and max concurrency within a task. So I pick the name "maxSourceParallelism" to 
distinguish it from the runtime. I put "max" as prefix since the actual splits 
are equal or lower than this number, e.g. split a kafka topic of 10 partitions 
with 1000 desiredNumSplits will still get 10 splits. Not sure whether the 
naming of is misleading. We can call it desiredParallelism or maxParallelism if 
that's better.

On the other hand, we do hope BEAM can support the use case of having default 
splits without the need for desiredNumSplits. For most LinkedIn use cases, this 
is very helpful since the user does not need to know the exact partitions of 
each their own topic to specify the desiredNumSplits. They can simply rely on 
the default splits to get the expected parallelism (assume the default splits 
for Kafka source is the number of partitions). Can we support this in BEAM (I 
can create a ticket for it)?

> SamzaPipelineOptions probably shouldn't need maxSourceParallelism
> -
>
> Key: BEAM-3542
> URL: https://issues.apache.org/jira/browse/BEAM-3542
> Project: Beam
>  Issue Type: Bug
>  Components: runner-samza
>Reporter: Kenneth Knowles
>Priority: Minor
>
> Let's continue to examine and make sure the runner is using unbounded sources 
> in a Beam-ish consistent way. If it is necessary, that is OK too, but it 
> seemed there might be things to clarify since the code review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)