[
https://issues.apache.org/jira/browse/BEAM-3542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349013#comment-16349013
]
Xinyu Liu commented on BEAM-3542:
-
Let me add more details about this: maxSourceParallelism is currently used
directly as the desiredNumSplits to split an UnboundedSource (close to
FlinkPipelineOptions.parallelism). Based on the number of splits, Samza will
create the same number of tasks, which is the logical parallelism in Samza. The
runtime parallelism in Samza is defined by container count, thread pool size
and max concurrency within a task. So I pick the name "maxSourceParallelism" to
distinguish it from the runtime. I put "max" as prefix since the actual splits
are equal or lower than this number, e.g. split a kafka topic of 10 partitions
with 1000 desiredNumSplits will still get 10 splits. Not sure whether the
naming of is misleading. We can call it desiredParallelism or maxParallelism if
that's better.
On the other hand, we do hope BEAM can support the use case of having default
splits without the need for desiredNumSplits. For most LinkedIn use cases, this
is very helpful since the user does not need to know the exact partitions of
each their own topic to specify the desiredNumSplits. They can simply rely on
the default splits to get the expected parallelism (assume the default splits
for Kafka source is the number of partitions). Can we support this in BEAM (I
can create a ticket for it)?
> SamzaPipelineOptions probably shouldn't need maxSourceParallelism
> -
>
> Key: BEAM-3542
> URL: https://issues.apache.org/jira/browse/BEAM-3542
> Project: Beam
> Issue Type: Bug
> Components: runner-samza
>Reporter: Kenneth Knowles
>Priority: Minor
>
> Let's continue to examine and make sure the runner is using unbounded sources
> in a Beam-ish consistent way. If it is necessary, that is OK too, but it
> seemed there might be things to clarify since the code review.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)