[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2016-03-31 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220089#comment-15220089
 ] 

Eugene Kirpichov commented on BEAM-68:
--

I re-duped BEAM-169 against BEAM-92.
- I agree that it won't work with empty K, however that should be relatively 
unlikely if the user has enough data that sharding it makes sense, and if their 
hash function is good; and in some sharded sink scenarios it may be desirable 
to not write empty shards.
- I'm not sure what you mean by "won't scale": individual shards have to be 
written sequentially, but they can be written in parallel with each other in 
this proposed implementation: dynamic rebalancing will separate the shard keys 
from each other.

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2016-03-30 Thread Daniel Halperin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219376#comment-15219376
 ] 

Daniel Halperin commented on BEAM-68:
-

Okay, I think I'm partially wrong.

KV -> ParDo(process all elements in a single DoFn with per-K 
startBundle/endBundle/etc) is doable as a solution to BEAM-92.
   -It won't of course work with empty K, so you can't in fact guarantee 
numShards is matched.
   -It won't scale.
   -It overly restricts implementation.
but I think it works, in essence, without a model change.

Would you prefer to dupe 169 against 92? I don't see a need for more bug bloat 
here tho. Have suggested edits to the text of either bug that will fix?

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2016-03-30 Thread Daniel Halperin (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219353#comment-15219353
 ] 

Daniel Halperin commented on BEAM-68:
-

Eugene: I disagree with the premise that even the group by key trick works 
without runner support. Fusion breaks and dynamic work rebalancing violate all 
such assumptions. Model changes are all that will guarantee anything here.

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2016-03-30 Thread Davor Bonaci (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219143#comment-15219143
 ] 

Davor Bonaci commented on BEAM-68:
--

The uses cases are different for sure.

However, it is unclear whether the solution is the same. Perhaps we'll end up 
with different APIs, but that is far from certain, I think. It is explicitly 
called out as a scenario here, so I don't think we need to have a separate 
tracking issue.

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2016-03-30 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219126#comment-15219126
 ] 

Eugene Kirpichov commented on BEAM-68:
--

I think the two use cases are quite different and need different APIs. The 
first use case is about the expected outputs of the pipeline (i.e. 
runner-agnostic); the second is about the execution (implemented in a 
runner-specific way).

I think for the first case we just need to come up with a convenient way to 
encode the sharding needs of custom sinks (e.g. output filename for a 
file-based sink, or say, shard of a pubsub topic for a sink that outputs to 
some pubsub system), and it can probably be done as a PTransform. I think it 
can actually be done using the "data-dependent sinks" API (route the data to a 
destination shard, one sink per shard), BEAM-92.

The second case requires support in the Beam model.

I'm inclined to reopen BEAM-159, let me know what you think.

> Support for limiting parallelism of a step
> --
>
> Key: BEAM-68
> URL: https://issues.apache.org/jira/browse/BEAM-68
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Daniel Halperin
>
> Users may want to limit the parallelism of a step. Two classic uses cases are:
> - User wants to produce at most k files, so sets 
> TextIO.Write.withNumShards(k).
> - External API only supports k QPS, so user sets a limit of k/(expected 
> QPS/step) on the ParDo that makes the API call.
> Unfortunately, there is no way to do this effectively within the Beam model. 
> A GroupByKey with exactly k keys will guarantee that only k elements are 
> produced, but runners are free to break fusion in ways that each element may 
> be processed in parallel later.
> To implement this functionaltiy, I believe we need to add this support to the 
> Beam Model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)