Re: [jira] [Commented] (BEAM-755) beam-runners-core-java NeedsRunner tests not executing

2016-10-19 Thread Lukasz Cwik
At the point in time this was created, there were `NeedsRunner` tests.

On Wed, Oct 19, 2016 at 9:34 AM, Kenneth Knowles (JIRA) 
wrote:

>
> [ https://issues.apache.org/jira/browse/BEAM-755?page=com.
> atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=15589182#comment-15589182 ]
>
> Kenneth Knowles commented on BEAM-755:
> --
>
> Noting that the resolution is that `beam-runners-core-java` actually just
> doesn't have any `NeedsRunner` tests. We should still get the `pom` set up
> right so that if one is added it will execute.
>
> > beam-runners-core-java NeedsRunner tests not executing
> > --
> >
> > Key: BEAM-755
> > URL: https://issues.apache.org/jira/browse/BEAM-755
> > Project: Beam
> >  Issue Type: Bug
> >  Components: runner-core
> >Reporter: Luke Cwik
> >Assignee: Kenneth Knowles
> >Priority: Minor
> > Fix For: 0.3.0-incubating
> >
> >
> > org.apache.beam:beam-runners-core-java is not specified as an
> integration test dependency to scan within runners/pom.xml
> > There is also in runners/direct-java/pom.xml where its
> org.apache.beam:beam-runners-java-core and should be
> org.apache.beam:beam-runners-core-java
> > Finally, even if these dependencies are added and the typo fixed. When
> running the runnable on service integration tests, SplittableParDoTest
> which contains @RunnableOnService tests (part of runners/core-java) doesn't
> execute.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: [jira] [Commented] (BEAM-434) When examples write output to file it creates many output files instead of one

2016-07-12 Thread Lukasz Cwik
If we go with any option that restricts the number of outputs then in the
example we should discuss what it does and why it is not considered a good
thing.

On Tue, Jul 12, 2016 at 2:11 AM, Amit Sela (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/BEAM-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15372225#comment-15372225
> ]
>
> Amit Sela commented on BEAM-434:
> 
>
> I sort of prefer 2, but by letting the user pass the numShards
> configuration (which may need a better name)
> Like I mentioned in the PR, if we want to give a simple example result on
> one hand, while keeping in the user's mind the fact that multiple shards
> are a thing to consider, we could add a --numShards option and add it to
> the examples code with a default of 1 (or 3).
> If we want the users to know about multiple output shards, why should we
> keep the examples "pure" ?
>
> How about adding an option named "--numOutputShards" with default value 1
> (or 3, I could live with 3 :) ) and adding this to the examples README,
> thus giving a better experience in terms of "seeing" the output, while
> keeping the multiple-shards "on the table" and as a bonus, the Travis CI
> tests could still run with as many shards as we want (while I wanted
> examples to be easy enough, I definitely didn't want that for Travis!)
>
> WDYT ?
>
>
> > When examples write output to file it creates many output files instead
> of one
> >
> --
> >
> > Key: BEAM-434
> > URL: https://issues.apache.org/jira/browse/BEAM-434
> > Project: Beam
> >  Issue Type: Bug
> >  Components: examples-java
> >Reporter: Amit Sela
> >Assignee: Amit Sela
> >Priority: Minor
> >
> > When using `TextIO.Write.to("/path/to/output")` without any
> restrictions on the number of shards, it might generate many output files
> (depending on your input), for WordCount for example, you'll get as many
> output files as unique words in your input.
> > Since I think examples are expected to execute in a friendly manner to
> "see" what it does and not optimize for performance in some way, I suggest
> to use `withoutSharding()` when writing the example output to an output
> file.
> > Examples I could find that behave this way:
> > org.apache.beam.examples.WordCount
> > org.apache.beam.examples.complete.TfIdf
> > org.apache.beam.examples.cookbook.DeDupExample
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: [jira] [Created] (BEAM-185) XmlSink output file pattern missing "." in extension

2016-04-08 Thread Lukasz Cwik
There is some automatic "." adding logic inside DataflowPipelineRunner,
maybe this could be extended to be generic for FileBasedSource

On Fri, Apr 8, 2016 at 4:24 PM, Scott Wegner (JIRA)  wrote:

> Scott Wegner created BEAM-185:
> -
>
>  Summary: XmlSink output file pattern missing "." in extension
>  Key: BEAM-185
>  URL: https://issues.apache.org/jira/browse/BEAM-185
>  Project: Beam
>   Issue Type: Bug
> Reporter: Scott Wegner
> Priority: Minor
>
>
> The XmlSink takes as input a filename prefix and adds the shard name and
> extension automatically. However, it is missing the "." when adding the
> extension.
>
> For an XmlSink configured as:
>
> {{XmlSink.write().toFilenamePrefix("foobar");}}
>
> the fileNamePattern is {{foobar-S-of-Nxml}}. Instead, it should be
> {{foobar-S-of-N.xml}}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: [jira] [Created] (BEAM-178) stdout vs logging in DataflowPipelineRunner?

2016-04-08 Thread Lukasz Cwik
I think this was because we wanted to make sure the user would get at least
the fact that they submitted a pipeline even if they haven't setup logging
correctly.

On Wed, Apr 6, 2016 at 2:19 PM, Daniel Halperin (JIRA) 
wrote:

> Daniel Halperin created BEAM-178:
> 
>
>  Summary: stdout vs logging in DataflowPipelineRunner?
>  Key: BEAM-178
>  URL: https://issues.apache.org/jira/browse/BEAM-178
>  Project: Beam
>   Issue Type: Bug
>   Components: runner-dataflow
> Reporter: Daniel Halperin
> Assignee: Davor Bonaci
> Priority: Minor
>
>
> We seem to thoroughly intermingle logging and println. Is this deliberate?
>
> e.g.,
>
> {code}
> LOG.info("To access the Dataflow monitoring console, please navigate
> to {}",
> MonitoringUtil.getJobMonitoringPageURL(options.getProject(),
> jobResult.getId()));
> System.out.println("Submitted job: " + jobResult.getId());
> {code}
>
> Original genesis for this was noticing a println in a backport Cl:
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/com/google/cloud/dataflow/sdk/runners/DataflowPipelineRunner.java#L451
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: [jira] [Commented] (BEAM-68) Support for limiting parallelism of a step

2016-03-31 Thread Lukasz Cwik
I believe you can guarantee the number of shards (with a more complex set
of transforms). You just need to figure out which shards are empty, and
force the write operation. We can have two implementations of write, one
which doesn't write when zero elements (the default), and one which does go
through the motions of doing the write for zero elements.

num shards is a parallel limit control, it doesn't scale already. The thing
we lose most is the ability to dynamically rebalance work if there is a
straggler.

overly restrictive implementation, this is one of those cases where you
have a composite ptransform which has a basic implementation using GBK
underneath the hood which runners can override if they can force the
parallelism constraint in a better way.


On Wed, Mar 30, 2016 at 10:28 PM, Daniel Halperin (JIRA) 
wrote:

>
> [
> https://issues.apache.org/jira/browse/BEAM-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219376#comment-15219376
> ]
>
> Daniel Halperin commented on BEAM-68:
> -
>
> Okay, I think I'm partially wrong.
>
> KV> -> ParDo(process all elements in a single DoFn with
> per-K startBundle/endBundle/etc) is doable as a solution to BEAM-92.
>-It won't of course work with empty K, so you can't in fact guarantee
> numShards is matched.
>-It won't scale.
>-It overly restricts implementation.
> but I think it works, in essence, without a model change.
>
> Would you prefer to dupe 169 against 92? I don't see a need for more bug
> bloat here tho. Have suggested edits to the text of either bug that will
> fix?
>
> > Support for limiting parallelism of a step
> > --
> >
> > Key: BEAM-68
> > URL: https://issues.apache.org/jira/browse/BEAM-68
> > Project: Beam
> >  Issue Type: New Feature
> >  Components: beam-model
> >Reporter: Daniel Halperin
> >
> > Users may want to limit the parallelism of a step. Two classic uses
> cases are:
> > - User wants to produce at most k files, so sets
> TextIO.Write.withNumShards(k).
> > - External API only supports k QPS, so user sets a limit of k/(expected
> QPS/step) on the ParDo that makes the API call.
> > Unfortunately, there is no way to do this effectively within the Beam
> model. A GroupByKey with exactly k keys will guarantee that only k elements
> are produced, but runners are free to break fusion in ways that each
> element may be processed in parallel later.
> > To implement this functionaltiy, I believe we need to add this support
> to the Beam Model.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>