Re: Low availability on my end in the coming 3 weeks

2017-04-18 Thread Kenneth Knowles
+1 to good travels & move.

Your absence will surely be noticed for its own sake. See you when you
resurface :-)

Kenn

On Wed, Apr 12, 2017 at 6:28 AM, Jean-Baptiste Onofré 
wrote:

> Thanks for the update and your trust Amit !
>
> Safe travel and take your time to move to your new home ;)
>
> Regards
> JB
>
>
> On 04/12/2017 03:23 PM, Amit Sela wrote:
>
>> Hi everyone,
>>
>> I will be traveling and moving in the next ~3 weeks so I will be less
>> available than usual.
>> I believe our dev community is more mature now so my absence won't be
>> noticed :-) but I still asked JB to formally take my place with anything
>> concerning the Spark runner and he kindly agreed.
>>
>> Thanks,
>> Amit
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Naming of Combine.Globally

2017-04-18 Thread Robert Bradshaw
On Tue, Apr 18, 2017 at 3:03 AM, Wesley Tanaka
 wrote:
> I believe that foldl in Haskell https://www.haskell.org/hoogle/?hoogle=foldl 
> admits a separate accumulator type from the type of the data structure being 
> "folded"
> And, well, python lets you have your way with mixing types, but this 
> certainly works as another example:python -c "print(reduce(lambda ac, elem: 
> '%s%d' % (ac,elem), [1,2,3,4,5], ''))"
> Is there anything special about the AccumT->OutputT conversion that 
> extractOutput() needs to be in the same interface as createAccumulator(), 
> addInput() and mergeAccumulators()?  If the interface were segregated such 
> that one interface managed the InputT->AccumT conversion, and the second 
> managed the AccumT->InputT conversion, it seems like maybe the 
> AccumT->OutputT conversion could even get replaced with MapElements?  And 
> then the full current "Combine" functionality could be implemented as a 
> composition of the lower-level primitives?

It is correct that the AccumT->OutputT conversion could be implemented
as a subsequent MapElements operation. One reason that it's not is
that in practice it's tightly coupled with the other parts of the
Combine as a single semantic unit (e.g. one things of taking the
"mean" as a single operation). Once one moves beyond the simple
combiners (with identity extractOutput) there's often a 1:1
correspondence between the CombineFn and its output extraction that
reduces potential value of splitting them while placing further burden
on their user (e.g. there's little use for the Quantiles intermediate
accumulator without the mapping from that actual quantiles, and vice
versa).

Also, letting the extractOutput be part of the CombineFn itself rather
than simply providing a separate Map allows one to use the CombineFn
uniformly for global combine, per-key combine, and combined state. It
also makes it more compossible (e.g. the TupleCombineFns at
https://github.com/apache/beam/blob/release-0.6.0/sdks/python/apache_beam/transforms/combiners.py#L443
that create a single CombineFn (including the output extraction) from
a set of CombineFns applying each in parallel.

- Robert


Re: [DISCUSSION] PAssert success/failure count validation for all runners

2017-04-18 Thread Aviem Zur
So to summarize, most seem to agree that:

1) We should verify PAssert execution occurred in all runners.
2) We should verify this using metrics in sdk-java-core for runners which
support metrics. This will save those runner writers from having to verify
this in the runner code. See:
https://issues.apache.org/jira/browse/BEAM-1763
3) Runners which do not support metrics need to verify this in another way
(or alternatively, implement metrics support to get this functionality
automatically via (2).

This work is tracked in the following ticket:
https://issues.apache.org/jira/browse/BEAM-2001
Runners which support this either via (2)/(3) will have their respective
subtask marked as resolved.

On Mon, Apr 17, 2017 at 9:43 PM Pablo Estrada 
wrote:

> Hi all,
> sorry about the long silence. I was on vacation all of last week.
> Also, the implementation of my proposal is in this PR:
> https://github.com/apache/beam/pull/2417.
>
> I don't currently have plans on the work for
> https://issues.apache.org/jira/browse/BEAM-1763. My main focus as of right
> now is on the removal of aggregators from the Java SDK, which is tracked by
> https://issues.apache.org/jira/browse/BEAM-775.
>
> As per the previous discussion, it seems reasonable that I go ahead with PR
> 2417 and move on with the removal of aggregators from other parts of the
> SDK. Is this reasonable to the community?
> Best
> -P.
>
>
> On Mon, Apr 17, 2017 at 11:17 AM Dan Halperin  wrote:
>
> > (I'll also note that the bit about URNs and whatnot is decouplable -- we
> > have Pipeline surgery APIs right now, and will someday have
> > URN-with-payload-based-surgery APIs, but we can certainly do the work to
> > make PAssert more overridable now and be ready for full Runner API work
> > later).
> >
> > On Mon, Apr 17, 2017 at 11:14 AM, Dan Halperin 
> > wrote:
> >
> >> I believe Pablo's existing proposal is here:
> >>
> https://lists.apache.org/thread.html/CADJfNJBEuWYhhH1mzMwwvUL9Wv2HyFc8_E=9zybkwugt8ca...@mail.gmail.com
> >>
> >> The idea is that we'll stick with the current design -- aggregator- (but
> >> now metric)-driven validation of PAssert. Runners that do not support
> these
> >> things can override the validation step to do something different.
> >>
> >> This seems to me to satisfy all parties and unblock removal of
> >> aggregators. If a runner supports aggregators but not metrics because
> the
> >> semantics are slightly different, that runner can override the behavior.
> >>
> >> I agree that all runners doing sensible things with PAssert should be a
> >> first stable release blocker. But I do not think it's important that all
> >> runners verify them the same way. There has been no proposal that
> provides
> >> a single validation mechanism that works well with all runners.
> >>
> >> On Wed, Apr 12, 2017 at 9:24 AM, Aljoscha Krettek 
> >> wrote:
> >>
> >>> That sounds very good! Now we only have to manage to get this in before
> >>> the first stable release because I think this is a very important
> signal
> >>> for ensuring Runner correctness.
> >>>
> >>> @Pablo Do you already have plans regarding 3., i.e. stable URNs for the
> >>> assertions. And also for verifying them in a runner-agnostic way in
> >>> TestStream, i.e. https://issues.apache.org/jira/browse/BEAM-1763? <
> >>> https://issues.apache.org/jira/browse/BEAM-1763?>
> >>>
> >>> Best,
> >>> Aljoscha
> >>>
> >>> > On 10. Apr 2017, at 10:10, Kenneth Knowles 
> >>> wrote:
> >>> >
> >>> > On Sat, Apr 8, 2017 at 7:00 AM, Aljoscha Krettek <
> aljos...@apache.org>
> >>> > wrote:
> >>> >
> >>> >> @kenn What’s the design you’re mentioning? (I probably missed it
> >>> because
> >>> >> I’m not completely up to data on the Jiras and ML because of Flink
> >>> Forward
> >>> >> preparations)
> >>> >>
> >>> >
> >>> > There are three parts (I hope I say this in a way that makes everyone
> >>> happy)
> >>> >
> >>> > 1. Each assertion transform is followed by a verifier transform that
> >>> fails
> >>> > if it sees a non-success (in addition to bumping metrics).
> >>> > 2. Use the same trick PAssert already uses, flatten in a dummy value
> to
> >>> > reduce the risk that the verifier transform never runs.
> >>> > 3. Stable URNs for the assertion and verifier transforms so a runner
> >>> has a
> >>> > good chance to wire custom implementations, if it helps.
> >>> >
> >>> > I think someone mentioned it earlier, but these also work better with
> >>> > metrics that overcount, since it is now about covering the verifier
> >>> > transforms rather than an absolute number of successes.
> >>> >
> >>> > Kenn
> >>> >
> >>> >
> >>> >>> On 7. Apr 2017, at 12:42, Kenneth Knowles 
> >>> >> wrote:
> >>> >>>
> >>> >>> We also have a design that improves the signal even without
> metrics,
> >>> so
> >>> >> I'm
> >>> >>> pretty happy with this.
> >>> >>>
> >>> >>> On Fri, Apr 7, 2017 at 

Re: Pipeline termination in the unified Beam model

2017-04-18 Thread Aljoscha Krettek
BEAM-593 is blocked by Flink issues:

- https://issues.apache.org/jira/browse/FLINK-2313: Change Streaming Driver 
Execution Model
- https://issues.apache.org/jira/browse/FLINK-4272: Create a JobClient for job 
control and monitoring

where the second is kind of a duplicate of the first one.

There is a pending PR that is a bit dated but this week I’ll try and see if I 
can massage this into shape for Flink 1.3, which should happen in about 1.5 
months.

> On 18. Apr 2017, at 18:17, Ismaël Mejía  wrote:
> 
> +1 Having a unified termination semantics for all runners is super important.
> 
> Stas or Aviem, is it feasible to do this for the Spark runner or the
> timeout is due to a technical limitation of spark.
> 
> Thomas Weise, Aljoscha anything to say on this?
> 
> Aljoscha, what is the current status for the Flink runner. is there
> any progress towards BEAM-593 ?
> 
> 
> On Tue, Apr 18, 2017 at 5:05 PM, Stas Levin  wrote:
>> Ted, the timeout is needed mostly for testing purposes.
>> AFAIK there is no easy way to express the fact a source is "done" in a
>> Spark native streaming application.
>> Moreover, the Spark streaming "native" flow can either "awaitTermination()"
>> or "awaitTerminationOrTimeout(...)". If you "awaitTermination" then you're
>> blocked until the execution is either stopped or has failed, so if you wish
>> to stop the app sooner, say after a certain period of time,
>> "awaitTerminationOrTimeout(...)" may be the way to go.
>> 
>> Using the unified approach discussed in this thread, when a source is
>> "done" (i.e. the watermark is +Infinity) the app (e.g. runner) would
>> gracefully stop.
>> 
>> 
>> 
>> On Tue, Apr 18, 2017 at 3:19 PM Ted Yu  wrote:
>> 
>>> Why is the timeout needed for Spark ?
>>> 
>>> Thanks
>>> 
 On Apr 18, 2017, at 3:05 AM, Etienne Chauchot 
>>> wrote:
 
 +1 on "runners really terminate in a timely manner to easily
>>> programmatically orchestrate Beam pipelines in a portable way, you do need
>>> to know whether
 the pipeline will finish without thinking about the specific runner and
>>> its options"
 
 As an example, in Nexmark, we have streaming mode tests, and for the
>>> benchmark, we need all the queries to behave the same between runners
>>> towards termination.
 
 For now, to have the consistent behavior, in this mode we need to set a
>>> timeout (a bit random and flaky) on waitUntilFinish() for spark but this
>>> timeout is not needed for direct runner.
 
 Etienne
 
> Le 02/03/2017 à 19:27, Kenneth Knowles a écrit :
> Isn't this already the case? I think semantically it is an unavoidable
> conclusion, so certainly +1 to that.
> 
> The DirectRunner and TestDataflowRunner both have this behavior already.
> I've always considered that a streaming job running forever is just
>>> [very]
> suboptimal shutdown latency :-)
> 
> Some bits of the discussion on the ticket seem to surround whether or
>>> how
> to communicate this property in a generic way. Since a runner owns its
> PipelineResult it doesn't seem necessary.
> 
> So is the bottom line just that you want to more strongly insist that
> runners really terminate in a timely manner? I'm +1 to that, too, for
> basically the reason Stas gives: In order to easily programmatically
> orchestrate Beam pipelines in a portable way, you do need to know
>>> whether
> the pipeline will finish without thinking about the specific runner and
>>> its
> options (as with our RunnableOnService tests).
> 
> Kenn
> 
> On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin
>>> 
> wrote:
> 
>> Note that even "unbounded pipeline in a streaming
>>> runner".waitUntilFinish()
>> can return, e.g., if you cancel it or terminate it. It's totally
>>> reasonable
>> for users to want to understand and handle these cases.
>> 
>> +1
>> 
>> Dan
>> 
>> On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
>> wrote:
>> 
>>> +1
>>> 
>>> Good idea !!
>>> 
>>> Regards
>>> JB
>>> 
>>> 
 On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
 
 Raising this onto the mailing list from
 https://issues.apache.org/jira/browse/BEAM-849
 
 The issue came up: what does it mean for a pipeline to finish, in the
>> Beam
 model?
 
 Note that I am deliberately not talking about "batch" and "streaming"
 pipelines, because this distinction does not exist in the model.
>>> Several
 runners have batch/streaming *modes*, which implement the same
>>> semantics
 (potentially different subsets: in batch mode typically a runner will
 reject pipelines that have at least one unbounded PCollection) but
>>> in an
 operationally 

Re: Build failed in Jenkins: beam_SeedJob #214

2017-04-18 Thread Jason Kuster
Yup -- it looks like we're going to need reapproval when we change our
jobs[1].

[1]
https://github.com/jenkinsci/job-dsl-plugin/wiki/Script-Security#script-approval

On Tue, Apr 18, 2017 at 10:20 AM, Ted Yu  wrote:

> Thanks Jason for the effort.
> Looks like we hit this:
>
> ERROR: script not yet approved for use
>
>
> On Tue, Apr 18, 2017 at 10:16 AM, Jason Kuster <
> jasonkus...@google.com.invalid> wrote:
>
> > I'm looking into this currently as well; that's one of the mitigations
> I'm
> > considering too but I'm giving the evaluate thing a try[1][2] (once it
> > starts running -- executors are full currently).
> >
> > [1] https://builds.apache.org/view/Beam/job/beam_SeedJob/215/
> > [2] https://github.com/apache/beam/pull/2578
> >
> > On Tue, Apr 18, 2017 at 10:12 AM, Ted Yu  wrote:
> >
> > > To unblock the builds, how about embedding functions used by respective
> > > scripts in the scripts themselves ?
> > >
> > > e.g. buildPerformanceTest is only used by the following scripts:
> > >
> > > .test-infra/jenkins/job_beam_PerformanceTests_Dataflow.groovy:
> > >  common_job_properties.buildPerformanceTest(delegate, argMap)
> > > .test-infra/jenkins/job_beam_PerformanceTests_JDBC.groovy:
> > >  common_job_properties.buildPerformanceTest(delegate, argMap)
> > > .test-infra/jenkins/job_beam_PerformanceTests_Spark.groovy:
> > >  common_job_properties.buildPerformanceTest(delegate, argMap)
> > >
> > > On Tue, Apr 18, 2017 at 10:05 AM, Davor Bonaci 
> wrote:
> > >
> > > > Not so simple, unfortunately [1]. Ideas welcome ;-)
> > > >
> > > > Davor
> > > >
> > > > [1]
> > > > https://github.com/jenkinsci/job-dsl-plugin/wiki/Migration#
> > > > migrating-to-160
> > > >
> > > > On Tue, Apr 18, 2017 at 9:57 AM, Ted Yu  wrote:
> > > >
> > > > > I wonder if we should adopt the suggestion here (involving
> evaluate):
> > > > > http://stackoverflow.com/questions/9136328/including-a-
> > > > > groovy-script-in-another-groovy
> > > > >
> > > > > On Tue, Apr 18, 2017 at 9:45 AM, Apache Jenkins Server <
> > > > > jenk...@builds.apache.org> wrote:
> > > > >
> > > > > > See  > > > > > redirect?page=changes>
> > > > > >
> > > > > > Changes:
> > > > > >
> > > > > > [jbonofre] [BEAM-59] Register standard FileSystems wherever we
> > > register
> > > > > >
> > > > > > [iemejia] Enable flink dependency enforcement and make
> dependencies
> > > > > > explicit
> > > > > >
> > > > > > [iemejia] Fix Javadoc warnings on Flink Runner
> > > > > >
> > > > > > [iemejia] Remove flink-annotations dependency
> > > > > >
> > > > > > [iemejia] [BEAM-1993] Remove special unbounded Flink source/sink
> > > > > >
> > > > > > [tgroh] Translate PTransforms to and from Runner API Protos
> > > > > >
> > > > > > [altay] Clean up DirectRunner Clock and TransformResult
> > > > > >
> > > > > > [altay] Remove overloading of __call__ in DirectRunner
> > > > > >
> > > > > > --
> > > > > > [...truncated 202.75 KB...]
> > > > > >  x [deleted] (none) -> origin/pr/902/merge
> > > > > >  x [deleted] (none) -> origin/pr/903/head
> > > > > >  x [deleted] (none) -> origin/pr/903/merge
> > > > > >  x [deleted] (none) -> origin/pr/904/head
> > > > > >  x [deleted] (none) -> origin/pr/904/merge
> > > > > >  x [deleted] (none) -> origin/pr/905/head
> > > > > >  x [deleted] (none) -> origin/pr/905/merge
> > > > > >  x [deleted] (none) -> origin/pr/906/head
> > > > > >  x [deleted] (none) -> origin/pr/906/merge
> > > > > >  x [deleted] (none) -> origin/pr/907/head
> > > > > >  x [deleted] (none) -> origin/pr/907/merge
> > > > > >  x [deleted] (none) -> origin/pr/908/head
> > > > > >  x [deleted] (none) -> origin/pr/909/head
> > > > > >  x [deleted] (none) -> origin/pr/909/merge
> > > > > >  x [deleted] (none) -> origin/pr/91/head
> > > > > >  x [deleted] (none) -> origin/pr/91/merge
> > > > > >  x [deleted] (none) -> origin/pr/910/head
> > > > > >  x [deleted] (none) -> origin/pr/911/head
> > > > > >  x [deleted] (none) -> origin/pr/911/merge
> > > > > >  x [deleted] (none) -> origin/pr/912/head
> > > > > >  x [deleted] (none) -> origin/pr/912/merge
> > > > > >  x [deleted] (none) -> origin/pr/913/head
> > > > > >  x [deleted] (none) -> origin/pr/913/merge
> > > > > >  x [deleted] (none) -> origin/pr/914/head
> > > > > >  x [deleted] (none) -> origin/pr/914/merge
> > > > > >  x [deleted] (none) -> origin/pr/915/head
> > > > > >  x [deleted] (none) -> origin/pr/915/merge
> > > > > >  x [deleted] (none) -> origin/pr/916/head
> > > > > >  x [deleted] (none) -> 

Re: Build failed in Jenkins: beam_SeedJob #214

2017-04-18 Thread Ted Yu
Thanks Jason for the effort.
Looks like we hit this:

ERROR: script not yet approved for use


On Tue, Apr 18, 2017 at 10:16 AM, Jason Kuster <
jasonkus...@google.com.invalid> wrote:

> I'm looking into this currently as well; that's one of the mitigations I'm
> considering too but I'm giving the evaluate thing a try[1][2] (once it
> starts running -- executors are full currently).
>
> [1] https://builds.apache.org/view/Beam/job/beam_SeedJob/215/
> [2] https://github.com/apache/beam/pull/2578
>
> On Tue, Apr 18, 2017 at 10:12 AM, Ted Yu  wrote:
>
> > To unblock the builds, how about embedding functions used by respective
> > scripts in the scripts themselves ?
> >
> > e.g. buildPerformanceTest is only used by the following scripts:
> >
> > .test-infra/jenkins/job_beam_PerformanceTests_Dataflow.groovy:
> >  common_job_properties.buildPerformanceTest(delegate, argMap)
> > .test-infra/jenkins/job_beam_PerformanceTests_JDBC.groovy:
> >  common_job_properties.buildPerformanceTest(delegate, argMap)
> > .test-infra/jenkins/job_beam_PerformanceTests_Spark.groovy:
> >  common_job_properties.buildPerformanceTest(delegate, argMap)
> >
> > On Tue, Apr 18, 2017 at 10:05 AM, Davor Bonaci  wrote:
> >
> > > Not so simple, unfortunately [1]. Ideas welcome ;-)
> > >
> > > Davor
> > >
> > > [1]
> > > https://github.com/jenkinsci/job-dsl-plugin/wiki/Migration#
> > > migrating-to-160
> > >
> > > On Tue, Apr 18, 2017 at 9:57 AM, Ted Yu  wrote:
> > >
> > > > I wonder if we should adopt the suggestion here (involving evaluate):
> > > > http://stackoverflow.com/questions/9136328/including-a-
> > > > groovy-script-in-another-groovy
> > > >
> > > > On Tue, Apr 18, 2017 at 9:45 AM, Apache Jenkins Server <
> > > > jenk...@builds.apache.org> wrote:
> > > >
> > > > > See  > > > > redirect?page=changes>
> > > > >
> > > > > Changes:
> > > > >
> > > > > [jbonofre] [BEAM-59] Register standard FileSystems wherever we
> > register
> > > > >
> > > > > [iemejia] Enable flink dependency enforcement and make dependencies
> > > > > explicit
> > > > >
> > > > > [iemejia] Fix Javadoc warnings on Flink Runner
> > > > >
> > > > > [iemejia] Remove flink-annotations dependency
> > > > >
> > > > > [iemejia] [BEAM-1993] Remove special unbounded Flink source/sink
> > > > >
> > > > > [tgroh] Translate PTransforms to and from Runner API Protos
> > > > >
> > > > > [altay] Clean up DirectRunner Clock and TransformResult
> > > > >
> > > > > [altay] Remove overloading of __call__ in DirectRunner
> > > > >
> > > > > --
> > > > > [...truncated 202.75 KB...]
> > > > >  x [deleted] (none) -> origin/pr/902/merge
> > > > >  x [deleted] (none) -> origin/pr/903/head
> > > > >  x [deleted] (none) -> origin/pr/903/merge
> > > > >  x [deleted] (none) -> origin/pr/904/head
> > > > >  x [deleted] (none) -> origin/pr/904/merge
> > > > >  x [deleted] (none) -> origin/pr/905/head
> > > > >  x [deleted] (none) -> origin/pr/905/merge
> > > > >  x [deleted] (none) -> origin/pr/906/head
> > > > >  x [deleted] (none) -> origin/pr/906/merge
> > > > >  x [deleted] (none) -> origin/pr/907/head
> > > > >  x [deleted] (none) -> origin/pr/907/merge
> > > > >  x [deleted] (none) -> origin/pr/908/head
> > > > >  x [deleted] (none) -> origin/pr/909/head
> > > > >  x [deleted] (none) -> origin/pr/909/merge
> > > > >  x [deleted] (none) -> origin/pr/91/head
> > > > >  x [deleted] (none) -> origin/pr/91/merge
> > > > >  x [deleted] (none) -> origin/pr/910/head
> > > > >  x [deleted] (none) -> origin/pr/911/head
> > > > >  x [deleted] (none) -> origin/pr/911/merge
> > > > >  x [deleted] (none) -> origin/pr/912/head
> > > > >  x [deleted] (none) -> origin/pr/912/merge
> > > > >  x [deleted] (none) -> origin/pr/913/head
> > > > >  x [deleted] (none) -> origin/pr/913/merge
> > > > >  x [deleted] (none) -> origin/pr/914/head
> > > > >  x [deleted] (none) -> origin/pr/914/merge
> > > > >  x [deleted] (none) -> origin/pr/915/head
> > > > >  x [deleted] (none) -> origin/pr/915/merge
> > > > >  x [deleted] (none) -> origin/pr/916/head
> > > > >  x [deleted] (none) -> origin/pr/916/merge
> > > > >  x [deleted] (none) -> origin/pr/917/head
> > > > >  x [deleted] (none) -> origin/pr/917/merge
> > > > >  x [deleted] (none) -> origin/pr/918/head
> > > > >  x [deleted] (none) -> origin/pr/918/merge
> > > > >  x [deleted] (none) -> origin/pr/919/head
> > > > >  x [deleted] (none) -> origin/pr/919/merge
> > > > >  x [deleted] (none)

Re: Build failed in Jenkins: beam_SeedJob #214

2017-04-18 Thread Jason Kuster
I'm looking into this currently as well; that's one of the mitigations I'm
considering too but I'm giving the evaluate thing a try[1][2] (once it
starts running -- executors are full currently).

[1] https://builds.apache.org/view/Beam/job/beam_SeedJob/215/
[2] https://github.com/apache/beam/pull/2578

On Tue, Apr 18, 2017 at 10:12 AM, Ted Yu  wrote:

> To unblock the builds, how about embedding functions used by respective
> scripts in the scripts themselves ?
>
> e.g. buildPerformanceTest is only used by the following scripts:
>
> .test-infra/jenkins/job_beam_PerformanceTests_Dataflow.groovy:
>  common_job_properties.buildPerformanceTest(delegate, argMap)
> .test-infra/jenkins/job_beam_PerformanceTests_JDBC.groovy:
>  common_job_properties.buildPerformanceTest(delegate, argMap)
> .test-infra/jenkins/job_beam_PerformanceTests_Spark.groovy:
>  common_job_properties.buildPerformanceTest(delegate, argMap)
>
> On Tue, Apr 18, 2017 at 10:05 AM, Davor Bonaci  wrote:
>
> > Not so simple, unfortunately [1]. Ideas welcome ;-)
> >
> > Davor
> >
> > [1]
> > https://github.com/jenkinsci/job-dsl-plugin/wiki/Migration#
> > migrating-to-160
> >
> > On Tue, Apr 18, 2017 at 9:57 AM, Ted Yu  wrote:
> >
> > > I wonder if we should adopt the suggestion here (involving evaluate):
> > > http://stackoverflow.com/questions/9136328/including-a-
> > > groovy-script-in-another-groovy
> > >
> > > On Tue, Apr 18, 2017 at 9:45 AM, Apache Jenkins Server <
> > > jenk...@builds.apache.org> wrote:
> > >
> > > > See  > > > redirect?page=changes>
> > > >
> > > > Changes:
> > > >
> > > > [jbonofre] [BEAM-59] Register standard FileSystems wherever we
> register
> > > >
> > > > [iemejia] Enable flink dependency enforcement and make dependencies
> > > > explicit
> > > >
> > > > [iemejia] Fix Javadoc warnings on Flink Runner
> > > >
> > > > [iemejia] Remove flink-annotations dependency
> > > >
> > > > [iemejia] [BEAM-1993] Remove special unbounded Flink source/sink
> > > >
> > > > [tgroh] Translate PTransforms to and from Runner API Protos
> > > >
> > > > [altay] Clean up DirectRunner Clock and TransformResult
> > > >
> > > > [altay] Remove overloading of __call__ in DirectRunner
> > > >
> > > > --
> > > > [...truncated 202.75 KB...]
> > > >  x [deleted] (none) -> origin/pr/902/merge
> > > >  x [deleted] (none) -> origin/pr/903/head
> > > >  x [deleted] (none) -> origin/pr/903/merge
> > > >  x [deleted] (none) -> origin/pr/904/head
> > > >  x [deleted] (none) -> origin/pr/904/merge
> > > >  x [deleted] (none) -> origin/pr/905/head
> > > >  x [deleted] (none) -> origin/pr/905/merge
> > > >  x [deleted] (none) -> origin/pr/906/head
> > > >  x [deleted] (none) -> origin/pr/906/merge
> > > >  x [deleted] (none) -> origin/pr/907/head
> > > >  x [deleted] (none) -> origin/pr/907/merge
> > > >  x [deleted] (none) -> origin/pr/908/head
> > > >  x [deleted] (none) -> origin/pr/909/head
> > > >  x [deleted] (none) -> origin/pr/909/merge
> > > >  x [deleted] (none) -> origin/pr/91/head
> > > >  x [deleted] (none) -> origin/pr/91/merge
> > > >  x [deleted] (none) -> origin/pr/910/head
> > > >  x [deleted] (none) -> origin/pr/911/head
> > > >  x [deleted] (none) -> origin/pr/911/merge
> > > >  x [deleted] (none) -> origin/pr/912/head
> > > >  x [deleted] (none) -> origin/pr/912/merge
> > > >  x [deleted] (none) -> origin/pr/913/head
> > > >  x [deleted] (none) -> origin/pr/913/merge
> > > >  x [deleted] (none) -> origin/pr/914/head
> > > >  x [deleted] (none) -> origin/pr/914/merge
> > > >  x [deleted] (none) -> origin/pr/915/head
> > > >  x [deleted] (none) -> origin/pr/915/merge
> > > >  x [deleted] (none) -> origin/pr/916/head
> > > >  x [deleted] (none) -> origin/pr/916/merge
> > > >  x [deleted] (none) -> origin/pr/917/head
> > > >  x [deleted] (none) -> origin/pr/917/merge
> > > >  x [deleted] (none) -> origin/pr/918/head
> > > >  x [deleted] (none) -> origin/pr/918/merge
> > > >  x [deleted] (none) -> origin/pr/919/head
> > > >  x [deleted] (none) -> origin/pr/919/merge
> > > >  x [deleted] (none) -> origin/pr/92/head
> > > >  x [deleted] (none) -> origin/pr/92/merge
> > > >  x [deleted] (none) -> origin/pr/920/head
> > > >  x [deleted] (none) -> origin/pr/920/merge
> > > >  x [deleted] (none) -> origin/pr/921/head
> > > >  x [deleted] (none) -> origin/pr/921/merge
> > > >  x [deleted] (none) -> 

Re: Build failed in Jenkins: beam_SeedJob #214

2017-04-18 Thread Ted Yu
To unblock the builds, how about embedding functions used by respective
scripts in the scripts themselves ?

e.g. buildPerformanceTest is only used by the following scripts:

.test-infra/jenkins/job_beam_PerformanceTests_Dataflow.groovy:
 common_job_properties.buildPerformanceTest(delegate, argMap)
.test-infra/jenkins/job_beam_PerformanceTests_JDBC.groovy:
 common_job_properties.buildPerformanceTest(delegate, argMap)
.test-infra/jenkins/job_beam_PerformanceTests_Spark.groovy:
 common_job_properties.buildPerformanceTest(delegate, argMap)

On Tue, Apr 18, 2017 at 10:05 AM, Davor Bonaci  wrote:

> Not so simple, unfortunately [1]. Ideas welcome ;-)
>
> Davor
>
> [1]
> https://github.com/jenkinsci/job-dsl-plugin/wiki/Migration#
> migrating-to-160
>
> On Tue, Apr 18, 2017 at 9:57 AM, Ted Yu  wrote:
>
> > I wonder if we should adopt the suggestion here (involving evaluate):
> > http://stackoverflow.com/questions/9136328/including-a-
> > groovy-script-in-another-groovy
> >
> > On Tue, Apr 18, 2017 at 9:45 AM, Apache Jenkins Server <
> > jenk...@builds.apache.org> wrote:
> >
> > > See  > > redirect?page=changes>
> > >
> > > Changes:
> > >
> > > [jbonofre] [BEAM-59] Register standard FileSystems wherever we register
> > >
> > > [iemejia] Enable flink dependency enforcement and make dependencies
> > > explicit
> > >
> > > [iemejia] Fix Javadoc warnings on Flink Runner
> > >
> > > [iemejia] Remove flink-annotations dependency
> > >
> > > [iemejia] [BEAM-1993] Remove special unbounded Flink source/sink
> > >
> > > [tgroh] Translate PTransforms to and from Runner API Protos
> > >
> > > [altay] Clean up DirectRunner Clock and TransformResult
> > >
> > > [altay] Remove overloading of __call__ in DirectRunner
> > >
> > > --
> > > [...truncated 202.75 KB...]
> > >  x [deleted] (none) -> origin/pr/902/merge
> > >  x [deleted] (none) -> origin/pr/903/head
> > >  x [deleted] (none) -> origin/pr/903/merge
> > >  x [deleted] (none) -> origin/pr/904/head
> > >  x [deleted] (none) -> origin/pr/904/merge
> > >  x [deleted] (none) -> origin/pr/905/head
> > >  x [deleted] (none) -> origin/pr/905/merge
> > >  x [deleted] (none) -> origin/pr/906/head
> > >  x [deleted] (none) -> origin/pr/906/merge
> > >  x [deleted] (none) -> origin/pr/907/head
> > >  x [deleted] (none) -> origin/pr/907/merge
> > >  x [deleted] (none) -> origin/pr/908/head
> > >  x [deleted] (none) -> origin/pr/909/head
> > >  x [deleted] (none) -> origin/pr/909/merge
> > >  x [deleted] (none) -> origin/pr/91/head
> > >  x [deleted] (none) -> origin/pr/91/merge
> > >  x [deleted] (none) -> origin/pr/910/head
> > >  x [deleted] (none) -> origin/pr/911/head
> > >  x [deleted] (none) -> origin/pr/911/merge
> > >  x [deleted] (none) -> origin/pr/912/head
> > >  x [deleted] (none) -> origin/pr/912/merge
> > >  x [deleted] (none) -> origin/pr/913/head
> > >  x [deleted] (none) -> origin/pr/913/merge
> > >  x [deleted] (none) -> origin/pr/914/head
> > >  x [deleted] (none) -> origin/pr/914/merge
> > >  x [deleted] (none) -> origin/pr/915/head
> > >  x [deleted] (none) -> origin/pr/915/merge
> > >  x [deleted] (none) -> origin/pr/916/head
> > >  x [deleted] (none) -> origin/pr/916/merge
> > >  x [deleted] (none) -> origin/pr/917/head
> > >  x [deleted] (none) -> origin/pr/917/merge
> > >  x [deleted] (none) -> origin/pr/918/head
> > >  x [deleted] (none) -> origin/pr/918/merge
> > >  x [deleted] (none) -> origin/pr/919/head
> > >  x [deleted] (none) -> origin/pr/919/merge
> > >  x [deleted] (none) -> origin/pr/92/head
> > >  x [deleted] (none) -> origin/pr/92/merge
> > >  x [deleted] (none) -> origin/pr/920/head
> > >  x [deleted] (none) -> origin/pr/920/merge
> > >  x [deleted] (none) -> origin/pr/921/head
> > >  x [deleted] (none) -> origin/pr/921/merge
> > >  x [deleted] (none) -> origin/pr/922/head
> > >  x [deleted] (none) -> origin/pr/922/merge
> > >  x [deleted] (none) -> origin/pr/923/head
> > >  x [deleted] (none) -> origin/pr/924/head
> > >  x [deleted] (none) -> origin/pr/925/head
> > >  x [deleted] (none) -> origin/pr/925/merge
> > >  x [deleted] (none) -> origin/pr/926/head
> > >  x [deleted] (none) -> origin/pr/926/merge
> > >  x [deleted] (none) -> origin/pr/927/head
> > >  x [deleted] (none) -> origin/pr/927/merge
> > >  x 

Re: Build failed in Jenkins: beam_SeedJob #214

2017-04-18 Thread Davor Bonaci
Not so simple, unfortunately [1]. Ideas welcome ;-)

Davor

[1]
https://github.com/jenkinsci/job-dsl-plugin/wiki/Migration#migrating-to-160

On Tue, Apr 18, 2017 at 9:57 AM, Ted Yu  wrote:

> I wonder if we should adopt the suggestion here (involving evaluate):
> http://stackoverflow.com/questions/9136328/including-a-
> groovy-script-in-another-groovy
>
> On Tue, Apr 18, 2017 at 9:45 AM, Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
> > See  > redirect?page=changes>
> >
> > Changes:
> >
> > [jbonofre] [BEAM-59] Register standard FileSystems wherever we register
> >
> > [iemejia] Enable flink dependency enforcement and make dependencies
> > explicit
> >
> > [iemejia] Fix Javadoc warnings on Flink Runner
> >
> > [iemejia] Remove flink-annotations dependency
> >
> > [iemejia] [BEAM-1993] Remove special unbounded Flink source/sink
> >
> > [tgroh] Translate PTransforms to and from Runner API Protos
> >
> > [altay] Clean up DirectRunner Clock and TransformResult
> >
> > [altay] Remove overloading of __call__ in DirectRunner
> >
> > --
> > [...truncated 202.75 KB...]
> >  x [deleted] (none) -> origin/pr/902/merge
> >  x [deleted] (none) -> origin/pr/903/head
> >  x [deleted] (none) -> origin/pr/903/merge
> >  x [deleted] (none) -> origin/pr/904/head
> >  x [deleted] (none) -> origin/pr/904/merge
> >  x [deleted] (none) -> origin/pr/905/head
> >  x [deleted] (none) -> origin/pr/905/merge
> >  x [deleted] (none) -> origin/pr/906/head
> >  x [deleted] (none) -> origin/pr/906/merge
> >  x [deleted] (none) -> origin/pr/907/head
> >  x [deleted] (none) -> origin/pr/907/merge
> >  x [deleted] (none) -> origin/pr/908/head
> >  x [deleted] (none) -> origin/pr/909/head
> >  x [deleted] (none) -> origin/pr/909/merge
> >  x [deleted] (none) -> origin/pr/91/head
> >  x [deleted] (none) -> origin/pr/91/merge
> >  x [deleted] (none) -> origin/pr/910/head
> >  x [deleted] (none) -> origin/pr/911/head
> >  x [deleted] (none) -> origin/pr/911/merge
> >  x [deleted] (none) -> origin/pr/912/head
> >  x [deleted] (none) -> origin/pr/912/merge
> >  x [deleted] (none) -> origin/pr/913/head
> >  x [deleted] (none) -> origin/pr/913/merge
> >  x [deleted] (none) -> origin/pr/914/head
> >  x [deleted] (none) -> origin/pr/914/merge
> >  x [deleted] (none) -> origin/pr/915/head
> >  x [deleted] (none) -> origin/pr/915/merge
> >  x [deleted] (none) -> origin/pr/916/head
> >  x [deleted] (none) -> origin/pr/916/merge
> >  x [deleted] (none) -> origin/pr/917/head
> >  x [deleted] (none) -> origin/pr/917/merge
> >  x [deleted] (none) -> origin/pr/918/head
> >  x [deleted] (none) -> origin/pr/918/merge
> >  x [deleted] (none) -> origin/pr/919/head
> >  x [deleted] (none) -> origin/pr/919/merge
> >  x [deleted] (none) -> origin/pr/92/head
> >  x [deleted] (none) -> origin/pr/92/merge
> >  x [deleted] (none) -> origin/pr/920/head
> >  x [deleted] (none) -> origin/pr/920/merge
> >  x [deleted] (none) -> origin/pr/921/head
> >  x [deleted] (none) -> origin/pr/921/merge
> >  x [deleted] (none) -> origin/pr/922/head
> >  x [deleted] (none) -> origin/pr/922/merge
> >  x [deleted] (none) -> origin/pr/923/head
> >  x [deleted] (none) -> origin/pr/924/head
> >  x [deleted] (none) -> origin/pr/925/head
> >  x [deleted] (none) -> origin/pr/925/merge
> >  x [deleted] (none) -> origin/pr/926/head
> >  x [deleted] (none) -> origin/pr/926/merge
> >  x [deleted] (none) -> origin/pr/927/head
> >  x [deleted] (none) -> origin/pr/927/merge
> >  x [deleted] (none) -> origin/pr/928/head
> >  x [deleted] (none) -> origin/pr/929/head
> >  x [deleted] (none) -> origin/pr/93/head
> >  x [deleted] (none) -> origin/pr/930/head
> >  x [deleted] (none) -> origin/pr/930/merge
> >  x [deleted] (none) -> origin/pr/931/head
> >  x [deleted] (none) -> origin/pr/931/merge
> >  x [deleted] (none) -> origin/pr/932/head
> >  x [deleted] (none) -> origin/pr/932/merge
> >  x [deleted] (none) -> origin/pr/933/head
> >  x [deleted] (none) -> origin/pr/933/merge
> >  x [deleted] (none) -> origin/pr/934/head
> >  x [deleted] (none) -> origin/pr/934/merge
> >  x [deleted] (none) -> 

Re: Read/Write Transform Documentation

2017-04-18 Thread Stephen Sisk
Hi Andrew,

I'm excited to hear you're working on an I/O - I'd love to hear any
feedback about the docs we've got written so far. Sorry they're in a
partially completed state.

Are you looking to develop in python or java? There's more specific docs
for python available in the python SDK guide
https://beam.apache.org/documentation/sdks/python/ (see "creating new
sources and sinks") - part of the Todo is to port that content over. :)

I would strongly recommend checking out the PTransform style guide -
https://beam.apache.org/contribute/ptransform-style-guide/ (if that didn't
catch your eye as important to read, I should probably make it more obvious
since it's got a lot of very relevant information.)

The other piece that is in draft form and not on the website yet is the
testing guide - the draft is available at
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?usp=sharing
-
if you need any clarification/have questions, feel free to add a comment
there and we'll follow up.

You may also want to comment on
https://issues.apache.org/jira/browse/BEAM-1271 since that's the Jira issue
for development of Accumulo read/write transforms. Looking at that issue, I
don't believe it's in active development.

If you've got specific questions, sending them to this mailing list will
definitely get you some help.

S

On Tue, Apr 18, 2017 at 9:20 AM Andrew Jessup 
wrote:

> Good Morning Beam Devs!
>
> I was looking through the beam dev docs and saw a TODO in the authoring IO
> section. I was wondering if this had been completed or is located elsewhere
> on the website.
> (https://beam.apache.org/documentation/io/authoring-overview/)
>
> I am working on making IO for Apache Accumulo and I just wanted to make
> sure I am going about things the right way. Lastly, if there is any direct
> POC's for this topic specifically, that information would be greatly
> appreciated.
>
> Andrew
>
> --
> From the Office of
>
> Andrew Jessup
> Laurel,MD
>


Read/Write Transform Documentation

2017-04-18 Thread Andrew Jessup
Good Morning Beam Devs!

I was looking through the beam dev docs and saw a TODO in the authoring IO
section. I was wondering if this had been completed or is located elsewhere
on the website.
(https://beam.apache.org/documentation/io/authoring-overview/)

I am working on making IO for Apache Accumulo and I just wanted to make
sure I am going about things the right way. Lastly, if there is any direct
POC's for this topic specifically, that information would be greatly
appreciated.

Andrew

-- 
>From the Office of

Andrew Jessup
Laurel,MD


Re: Naming of Combine.Globally

2017-04-18 Thread Eugene Kirpichov
...Curiously enough, ReduceFn is by far the closest of all these to a
sequential fold. It is also internal (runner-facing rather than
user-facing).

On Tue, Apr 18, 2017 at 8:27 AM Dan Halperin 
wrote:

> Great discussion! As Aljoscha says, Fold, Reduce, and Combine are all
> intertwined and not quite identical as we use them.
>
> Another simple but perhaps coy answer is that if you read the MapReduce
> paper by Dean and Ghemawat that started this all, they used "Map",
> "Reduce", and "Combine" (see section 4.3:
> https://research.google.com/archive/mapreduce.html)
>
> So then it's likely just the lineage of Beam as "evolving from MapReduce"
> :). [Looking around the source tree: we have MapElements, ReduceFn, and
> Combine. And the DataflowRunner has Shuffle inside of GroupByKey. ;)]
>
> Dan
>
> On Tue, Apr 18, 2017 at 3:16 AM, Aljoscha Krettek 
> wrote:
>
> > The definition of foldl in Haskell is the same as the description I gave
> > earlier:
> >
> > foldl :: (a -> b -> a) -> a -> [b] -> a
> >
> > The function (a -> b -> a) is what I described as (T, A) -> A and it’s
> > used to fold a list of b’s into an a (the accumulator type).
> >
> > You’re right that the mapping AccumT->OutputT is not important and could
> > be delegated to a separate method. The important part of the interface is
> > mergeAccumulators() since this makes the operation distributive: we can
> > “fold” a bunch of Ts into As in parallel (even on different machines) and
> > then merge them together. This is what is missing from a functional fold.
> >
> > Best,
> > Aljoscha
> >
> >
> > > On 18. Apr 2017, at 12:03, Wesley Tanaka 
> > wrote:
> > >
> > > I believe that foldl in Haskell https://www.haskell.org/
> > hoogle/?hoogle=foldl admits a separate accumulator type from the type of
> > the data structure being "folded"
> > > And, well, python lets you have your way with mixing types, but this
> > certainly works as another example:python -c "print(reduce(lambda ac,
> elem:
> > '%s%d' % (ac,elem), [1,2,3,4,5], ''))"
> > > Is there anything special about the AccumT->OutputT conversion that
> > extractOutput() needs to be in the same interface as createAccumulator(),
> > addInput() and mergeAccumulators()?  If the interface were segregated
> such
> > that one interface managed the InputT->AccumT conversion, and the second
> > managed the AccumT->InputT conversion, it seems like maybe the
> > AccumT->OutputT conversion could even get replaced with MapElements?  And
> > then the full current "Combine" functionality could be implemented as a
> > composition of the lower-level primitives?
> > > I haven't dug that deeply into Combine yet, so I may be missing
> > something obvious.
> > > ---
> > > Wesley Tanaka
> > > https://wtanaka.com/
> > >
> > > On Monday, April 17, 2017, 11:32:29 PM HST, Aljoscha Krettek <
> > aljos...@apache.org> wrote:Hi,
> > > I think both fold and reduce fail to capture all the power or (what we
> > call) combine. Reduce requires a function of type (T, T) -> T. It
> requires
> > that the output type be the same as the input type. Fold takes a function
> > (T, A) -> A where T is the input type and A is the accumulation type.
> Here,
> > the output type can be different from the input type. However, there is
> no
> > way of combining these aggregators so the operation is not distributive,
> > i.e. we cannot hierarchically apply the operation.
> > >
> > > Combine is the generalisation of this: We have three types, T (input),
> A
> > (accumulator), O (output) and we require a function that can merge
> > accumulators. The operation is distributive, meaning we can efficiently
> > execute it and we can also have an output type that is different from the
> > input type.
> > >
> > > Quick FYI: in Flink the CombineFn is called AggregatingFunction and
> > CombiningState is AggregatingState.
> > >
> > > Best,
> > > Aljoscha
> > >> On 18. Apr 2017, at 04:29, Wesley Tanaka 
> > wrote:
> > >>
> > >> As I start to understand Combine.Globally, it seems that it is, in
> > spirit, Beam's implementation of the "fold" higher-order function
> > >> https://en.wikipedia.org/wiki/Fold_(higher-order_function)#
> > Folds_in_various_languages
> > >>
> > >> Was there a reason the word "combine" was picked instead of either
> > "fold" or "reduce"?  From the wikipedia list above, it seems as though
> > "fold" and "reduce" are in much more common usage, so either of those
> might
> > be easier for newcomers to understand.
> > >> ---
> > >> Wesley Tanaka
> > >> http://wtanaka.com/
> >
> >
>


Re: Naming of Combine.Globally

2017-04-18 Thread Dan Halperin
Great discussion! As Aljoscha says, Fold, Reduce, and Combine are all
intertwined and not quite identical as we use them.

Another simple but perhaps coy answer is that if you read the MapReduce
paper by Dean and Ghemawat that started this all, they used "Map",
"Reduce", and "Combine" (see section 4.3:
https://research.google.com/archive/mapreduce.html)

So then it's likely just the lineage of Beam as "evolving from MapReduce"
:). [Looking around the source tree: we have MapElements, ReduceFn, and
Combine. And the DataflowRunner has Shuffle inside of GroupByKey. ;)]

Dan

On Tue, Apr 18, 2017 at 3:16 AM, Aljoscha Krettek 
wrote:

> The definition of foldl in Haskell is the same as the description I gave
> earlier:
>
> foldl :: (a -> b -> a) -> a -> [b] -> a
>
> The function (a -> b -> a) is what I described as (T, A) -> A and it’s
> used to fold a list of b’s into an a (the accumulator type).
>
> You’re right that the mapping AccumT->OutputT is not important and could
> be delegated to a separate method. The important part of the interface is
> mergeAccumulators() since this makes the operation distributive: we can
> “fold” a bunch of Ts into As in parallel (even on different machines) and
> then merge them together. This is what is missing from a functional fold.
>
> Best,
> Aljoscha
>
>
> > On 18. Apr 2017, at 12:03, Wesley Tanaka 
> wrote:
> >
> > I believe that foldl in Haskell https://www.haskell.org/
> hoogle/?hoogle=foldl admits a separate accumulator type from the type of
> the data structure being "folded"
> > And, well, python lets you have your way with mixing types, but this
> certainly works as another example:python -c "print(reduce(lambda ac, elem:
> '%s%d' % (ac,elem), [1,2,3,4,5], ''))"
> > Is there anything special about the AccumT->OutputT conversion that
> extractOutput() needs to be in the same interface as createAccumulator(),
> addInput() and mergeAccumulators()?  If the interface were segregated such
> that one interface managed the InputT->AccumT conversion, and the second
> managed the AccumT->InputT conversion, it seems like maybe the
> AccumT->OutputT conversion could even get replaced with MapElements?  And
> then the full current "Combine" functionality could be implemented as a
> composition of the lower-level primitives?
> > I haven't dug that deeply into Combine yet, so I may be missing
> something obvious.
> > ---
> > Wesley Tanaka
> > https://wtanaka.com/
> >
> > On Monday, April 17, 2017, 11:32:29 PM HST, Aljoscha Krettek <
> aljos...@apache.org> wrote:Hi,
> > I think both fold and reduce fail to capture all the power or (what we
> call) combine. Reduce requires a function of type (T, T) -> T. It requires
> that the output type be the same as the input type. Fold takes a function
> (T, A) -> A where T is the input type and A is the accumulation type. Here,
> the output type can be different from the input type. However, there is no
> way of combining these aggregators so the operation is not distributive,
> i.e. we cannot hierarchically apply the operation.
> >
> > Combine is the generalisation of this: We have three types, T (input), A
> (accumulator), O (output) and we require a function that can merge
> accumulators. The operation is distributive, meaning we can efficiently
> execute it and we can also have an output type that is different from the
> input type.
> >
> > Quick FYI: in Flink the CombineFn is called AggregatingFunction and
> CombiningState is AggregatingState.
> >
> > Best,
> > Aljoscha
> >> On 18. Apr 2017, at 04:29, Wesley Tanaka 
> wrote:
> >>
> >> As I start to understand Combine.Globally, it seems that it is, in
> spirit, Beam's implementation of the "fold" higher-order function
> >> https://en.wikipedia.org/wiki/Fold_(higher-order_function)#
> Folds_in_various_languages
> >>
> >> Was there a reason the word "combine" was picked instead of either
> "fold" or "reduce"?  From the wikipedia list above, it seems as though
> "fold" and "reduce" are in much more common usage, so either of those might
> be easier for newcomers to understand.
> >> ---
> >> Wesley Tanaka
> >> http://wtanaka.com/
>
>


Re: Pipeline termination in the unified Beam model

2017-04-18 Thread Stas Levin
Ted, the timeout is needed mostly for testing purposes.
AFAIK there is no easy way to express the fact a source is "done" in a
Spark native streaming application.
Moreover, the Spark streaming "native" flow can either "awaitTermination()"
or "awaitTerminationOrTimeout(...)". If you "awaitTermination" then you're
blocked until the execution is either stopped or has failed, so if you wish
to stop the app sooner, say after a certain period of time,
"awaitTerminationOrTimeout(...)" may be the way to go.

Using the unified approach discussed in this thread, when a source is
"done" (i.e. the watermark is +Infinity) the app (e.g. runner) would
gracefully stop.



On Tue, Apr 18, 2017 at 3:19 PM Ted Yu  wrote:

> Why is the timeout needed for Spark ?
>
> Thanks
>
> > On Apr 18, 2017, at 3:05 AM, Etienne Chauchot 
> wrote:
> >
> > +1 on "runners really terminate in a timely manner to easily
> programmatically orchestrate Beam pipelines in a portable way, you do need
> to know whether
> > the pipeline will finish without thinking about the specific runner and
> its options"
> >
> > As an example, in Nexmark, we have streaming mode tests, and for the
> benchmark, we need all the queries to behave the same between runners
> towards termination.
> >
> > For now, to have the consistent behavior, in this mode we need to set a
> timeout (a bit random and flaky) on waitUntilFinish() for spark but this
> timeout is not needed for direct runner.
> >
> > Etienne
> >
> >> Le 02/03/2017 à 19:27, Kenneth Knowles a écrit :
> >> Isn't this already the case? I think semantically it is an unavoidable
> >> conclusion, so certainly +1 to that.
> >>
> >> The DirectRunner and TestDataflowRunner both have this behavior already.
> >> I've always considered that a streaming job running forever is just
> [very]
> >> suboptimal shutdown latency :-)
> >>
> >> Some bits of the discussion on the ticket seem to surround whether or
> how
> >> to communicate this property in a generic way. Since a runner owns its
> >> PipelineResult it doesn't seem necessary.
> >>
> >> So is the bottom line just that you want to more strongly insist that
> >> runners really terminate in a timely manner? I'm +1 to that, too, for
> >> basically the reason Stas gives: In order to easily programmatically
> >> orchestrate Beam pipelines in a portable way, you do need to know
> whether
> >> the pipeline will finish without thinking about the specific runner and
> its
> >> options (as with our RunnableOnService tests).
> >>
> >> Kenn
> >>
> >> On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin
> 
> >> wrote:
> >>
> >>> Note that even "unbounded pipeline in a streaming
> runner".waitUntilFinish()
> >>> can return, e.g., if you cancel it or terminate it. It's totally
> reasonable
> >>> for users to want to understand and handle these cases.
> >>>
> >>> +1
> >>>
> >>> Dan
> >>>
> >>> On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
> >>> wrote:
> >>>
>  +1
> 
>  Good idea !!
> 
>  Regards
>  JB
> 
> 
> > On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
> >
> > Raising this onto the mailing list from
> > https://issues.apache.org/jira/browse/BEAM-849
> >
> > The issue came up: what does it mean for a pipeline to finish, in the
> >>> Beam
> > model?
> >
> > Note that I am deliberately not talking about "batch" and "streaming"
> > pipelines, because this distinction does not exist in the model.
> Several
> > runners have batch/streaming *modes*, which implement the same
> semantics
> > (potentially different subsets: in batch mode typically a runner will
> > reject pipelines that have at least one unbounded PCollection) but
> in an
> > operationally different way. However we should define pipeline
> >>> termination
> > at the level of the unified model, and then make sure that all
> runners
> >>> in
> > all modes implement that properly.
> >
> > One natural way is to say "a pipeline terminates when the output
> > watermarks
> > of all of its PCollection's progress to +infinity". (Note: this can
> be
> > generalized, I guess, to having partial executions of a pipeline: if
> > you're
> > interested in the full contents of only some collections, then you
> wait
> > until only the watermarks of those collections progress to infinity)
> >
> > A typical "batch" runner mode does not implement watermarks - we can
> >>> think
> > of it as assigning watermark -infinity to an output of a transform
> that
> > hasn't started executing yet, and +infinity to output of a transform
> >>> that
> > has finished executing. This is consistent with how such runners
> >>> implement
> > termination in practice.
> >
> > Dataflow streaming runner additionally implements such termination
> for
> > pipeline drain operation: it has 2 parts: 1) stop 

Re: Pipeline termination in the unified Beam model

2017-04-18 Thread Ted Yu
Why is the timeout needed for Spark ?

Thanks

> On Apr 18, 2017, at 3:05 AM, Etienne Chauchot  wrote:
> 
> +1 on "runners really terminate in a timely manner to easily programmatically 
> orchestrate Beam pipelines in a portable way, you do need to know whether
> the pipeline will finish without thinking about the specific runner and its 
> options"
> 
> As an example, in Nexmark, we have streaming mode tests, and for the 
> benchmark, we need all the queries to behave the same between runners towards 
> termination.
> 
> For now, to have the consistent behavior, in this mode we need to set a 
> timeout (a bit random and flaky) on waitUntilFinish() for spark but this 
> timeout is not needed for direct runner.
> 
> Etienne
> 
>> Le 02/03/2017 à 19:27, Kenneth Knowles a écrit :
>> Isn't this already the case? I think semantically it is an unavoidable
>> conclusion, so certainly +1 to that.
>> 
>> The DirectRunner and TestDataflowRunner both have this behavior already.
>> I've always considered that a streaming job running forever is just [very]
>> suboptimal shutdown latency :-)
>> 
>> Some bits of the discussion on the ticket seem to surround whether or how
>> to communicate this property in a generic way. Since a runner owns its
>> PipelineResult it doesn't seem necessary.
>> 
>> So is the bottom line just that you want to more strongly insist that
>> runners really terminate in a timely manner? I'm +1 to that, too, for
>> basically the reason Stas gives: In order to easily programmatically
>> orchestrate Beam pipelines in a portable way, you do need to know whether
>> the pipeline will finish without thinking about the specific runner and its
>> options (as with our RunnableOnService tests).
>> 
>> Kenn
>> 
>> On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin 
>> wrote:
>> 
>>> Note that even "unbounded pipeline in a streaming runner".waitUntilFinish()
>>> can return, e.g., if you cancel it or terminate it. It's totally reasonable
>>> for users to want to understand and handle these cases.
>>> 
>>> +1
>>> 
>>> Dan
>>> 
>>> On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
>>> wrote:
>>> 
 +1
 
 Good idea !!
 
 Regards
 JB
 
 
> On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
> 
> Raising this onto the mailing list from
> https://issues.apache.org/jira/browse/BEAM-849
> 
> The issue came up: what does it mean for a pipeline to finish, in the
>>> Beam
> model?
> 
> Note that I am deliberately not talking about "batch" and "streaming"
> pipelines, because this distinction does not exist in the model. Several
> runners have batch/streaming *modes*, which implement the same semantics
> (potentially different subsets: in batch mode typically a runner will
> reject pipelines that have at least one unbounded PCollection) but in an
> operationally different way. However we should define pipeline
>>> termination
> at the level of the unified model, and then make sure that all runners
>>> in
> all modes implement that properly.
> 
> One natural way is to say "a pipeline terminates when the output
> watermarks
> of all of its PCollection's progress to +infinity". (Note: this can be
> generalized, I guess, to having partial executions of a pipeline: if
> you're
> interested in the full contents of only some collections, then you wait
> until only the watermarks of those collections progress to infinity)
> 
> A typical "batch" runner mode does not implement watermarks - we can
>>> think
> of it as assigning watermark -infinity to an output of a transform that
> hasn't started executing yet, and +infinity to output of a transform
>>> that
> has finished executing. This is consistent with how such runners
>>> implement
> termination in practice.
> 
> Dataflow streaming runner additionally implements such termination for
> pipeline drain operation: it has 2 parts: 1) stop consuming input from
>>> the
> sources, and 2) wait until all watermarks progress to infinity.
> 
> Let us fill the gap by making this part of the Beam model and declaring
> that all runners should implement this behavior. This will give nice
> properties, e.g.:
> - A pipeline that has only bounded collections can be run by any runner
>>> in
> any mode, with the same results and termination behavior (this is
>>> actually
> my motivating example for raising this issue is: I was running
>>> Splittable
> DoFn tests
>  src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
> with the streaming Dataflow runner - these tests produce only bounded
> collections - and noticed that they wouldn't terminate even though all
> data
> was processed)
> - It will be possible to implement pipelines that 

Re: Naming of Combine.Globally

2017-04-18 Thread Wesley Tanaka
I believe that foldl in Haskell https://www.haskell.org/hoogle/?hoogle=foldl 
admits a separate accumulator type from the type of the data structure being 
"folded"
And, well, python lets you have your way with mixing types, but this certainly 
works as another example:python -c "print(reduce(lambda ac, elem: '%s%d' % 
(ac,elem), [1,2,3,4,5], ''))"
Is there anything special about the AccumT->OutputT conversion that 
extractOutput() needs to be in the same interface as createAccumulator(), 
addInput() and mergeAccumulators()?  If the interface were segregated such that 
one interface managed the InputT->AccumT conversion, and the second managed the 
AccumT->InputT conversion, it seems like maybe the AccumT->OutputT conversion 
could even get replaced with MapElements?  And then the full current "Combine" 
functionality could be implemented as a composition of the lower-level 
primitives?
I haven't dug that deeply into Combine yet, so I may be missing something 
obvious.
---
Wesley Tanaka
https://wtanaka.com/

On Monday, April 17, 2017, 11:32:29 PM HST, Aljoscha Krettek 
 wrote:Hi,
I think both fold and reduce fail to capture all the power or (what we call) 
combine. Reduce requires a function of type (T, T) -> T. It requires that the 
output type be the same as the input type. Fold takes a function (T, A) -> A 
where T is the input type and A is the accumulation type. Here, the output type 
can be different from the input type. However, there is no way of combining 
these aggregators so the operation is not distributive, i.e. we cannot 
hierarchically apply the operation.

Combine is the generalisation of this: We have three types, T (input), A 
(accumulator), O (output) and we require a function that can merge 
accumulators. The operation is distributive, meaning we can efficiently execute 
it and we can also have an output type that is different from the input type.

Quick FYI: in Flink the CombineFn is called AggregatingFunction and 
CombiningState is AggregatingState.

Best,
Aljoscha
> On 18. Apr 2017, at 04:29, Wesley Tanaka  wrote:
> 
> As I start to understand Combine.Globally, it seems that it is, in spirit, 
> Beam's implementation of the "fold" higher-order function
> https://en.wikipedia.org/wiki/Fold_(higher-order_function)#Folds_in_various_languages
> 
> Was there a reason the word "combine" was picked instead of either "fold" or 
> "reduce"?  From the wikipedia list above, it seems as though "fold" and 
> "reduce" are in much more common usage, so either of those might be easier 
> for newcomers to understand.
> ---
> Wesley Tanaka
> http://wtanaka.com/


Re: Pipeline termination in the unified Beam model

2017-04-18 Thread Etienne Chauchot
+1 on "runners really terminate in a timely manner to easily 
programmatically orchestrate Beam pipelines in a portable way, you do 
need to know whether
the pipeline will finish without thinking about the specific runner and 
its options"


As an example, in Nexmark, we have streaming mode tests, and for the 
benchmark, we need all the queries to behave the same between runners 
towards termination.


For now, to have the consistent behavior, in this mode we need to set a 
timeout (a bit random and flaky) on waitUntilFinish() for spark but this 
timeout is not needed for direct runner.


Etienne

Le 02/03/2017 à 19:27, Kenneth Knowles a écrit :

Isn't this already the case? I think semantically it is an unavoidable
conclusion, so certainly +1 to that.

The DirectRunner and TestDataflowRunner both have this behavior already.
I've always considered that a streaming job running forever is just [very]
suboptimal shutdown latency :-)

Some bits of the discussion on the ticket seem to surround whether or how
to communicate this property in a generic way. Since a runner owns its
PipelineResult it doesn't seem necessary.

So is the bottom line just that you want to more strongly insist that
runners really terminate in a timely manner? I'm +1 to that, too, for
basically the reason Stas gives: In order to easily programmatically
orchestrate Beam pipelines in a portable way, you do need to know whether
the pipeline will finish without thinking about the specific runner and its
options (as with our RunnableOnService tests).

Kenn

On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin 
wrote:


Note that even "unbounded pipeline in a streaming runner".waitUntilFinish()
can return, e.g., if you cancel it or terminate it. It's totally reasonable
for users to want to understand and handle these cases.

+1

Dan

On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
wrote:


+1

Good idea !!

Regards
JB


On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:


Raising this onto the mailing list from
https://issues.apache.org/jira/browse/BEAM-849

The issue came up: what does it mean for a pipeline to finish, in the

Beam

model?

Note that I am deliberately not talking about "batch" and "streaming"
pipelines, because this distinction does not exist in the model. Several
runners have batch/streaming *modes*, which implement the same semantics
(potentially different subsets: in batch mode typically a runner will
reject pipelines that have at least one unbounded PCollection) but in an
operationally different way. However we should define pipeline

termination

at the level of the unified model, and then make sure that all runners

in

all modes implement that properly.

One natural way is to say "a pipeline terminates when the output
watermarks
of all of its PCollection's progress to +infinity". (Note: this can be
generalized, I guess, to having partial executions of a pipeline: if
you're
interested in the full contents of only some collections, then you wait
until only the watermarks of those collections progress to infinity)

A typical "batch" runner mode does not implement watermarks - we can

think

of it as assigning watermark -infinity to an output of a transform that
hasn't started executing yet, and +infinity to output of a transform

that

has finished executing. This is consistent with how such runners

implement

termination in practice.

Dataflow streaming runner additionally implements such termination for
pipeline drain operation: it has 2 parts: 1) stop consuming input from

the

sources, and 2) wait until all watermarks progress to infinity.

Let us fill the gap by making this part of the Beam model and declaring
that all runners should implement this behavior. This will give nice
properties, e.g.:
- A pipeline that has only bounded collections can be run by any runner

in

any mode, with the same results and termination behavior (this is

actually

my motivating example for raising this issue is: I was running

Splittable

DoFn tests

with the streaming Dataflow runner - these tests produce only bounded
collections - and noticed that they wouldn't terminate even though all
data
was processed)
- It will be possible to implement pipelines that stream data for a

while

and then eventually successfully terminate based on some condition.

E.g. a

pipeline that watches a continuously growing file until it is marked
read-only, or a pipeline that reads a Kafka topic partition until it
receives a "poison pill" message. This seems handy.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





Re: Naming of Combine.Globally

2017-04-18 Thread Aljoscha Krettek
Hi,
I think both fold and reduce fail to capture all the power or (what we call) 
combine. Reduce requires a function of type (T, T) -> T. It requires that the 
output type be the same as the input type. Fold takes a function (T, A) -> A 
where T is the input type and A is the accumulation type. Here, the output type 
can be different from the input type. However, there is no way of combining 
these aggregators so the operation is not distributive, i.e. we cannot 
hierarchically apply the operation.

Combine is the generalisation of this: We have three types, T (input), A 
(accumulator), O (output) and we require a function that can merge 
accumulators. The operation is distributive, meaning we can efficiently execute 
it and we can also have an output type that is different from the input type.

Quick FYI: in Flink the CombineFn is called AggregatingFunction and 
CombiningState is AggregatingState.

Best,
Aljoscha
> On 18. Apr 2017, at 04:29, Wesley Tanaka  wrote:
> 
> As I start to understand Combine.Globally, it seems that it is, in spirit, 
> Beam's implementation of the "fold" higher-order function
> https://en.wikipedia.org/wiki/Fold_(higher-order_function)#Folds_in_various_languages
> 
> Was there a reason the word "combine" was picked instead of either "fold" or 
> "reduce"?  From the wikipedia list above, it seems as though "fold" and 
> "reduce" are in much more common usage, so either of those might be easier 
> for newcomers to understand.
> ---
> Wesley Tanaka
> http://wtanaka.com/



Jenkins build is still unstable: beam_Release_NightlySnapshot #392

2017-04-18 Thread Apache Jenkins Server
See 




Build failed in Jenkins: beam_SeedJob #213

2017-04-18 Thread Apache Jenkins Server
See 


Changes:

[altay] Add no-else return to pylintrc

[chamikara] Update assertions of source_test_utils from camelcase to

[dhalperi] Only compile HIFIO ITs when compiling with java 8.

[tgroh] Set the Project of a Table Reference at Runtime

[altay] clean up description for sdk_location

[dhalperi] [BEAM-1991] Sum.SumDoubleFn => Sum.ofDoubles

--
[...truncated 271.37 KB...]
 * [new ref] refs/pull/2472/head -> origin/pr/2472/head
 * [new ref] refs/pull/2472/merge -> origin/pr/2472/merge
 * [new ref] refs/pull/2473/head -> origin/pr/2473/head
 * [new ref] refs/pull/2473/merge -> origin/pr/2473/merge
 * [new ref] refs/pull/2474/head -> origin/pr/2474/head
 * [new ref] refs/pull/2474/merge -> origin/pr/2474/merge
 * [new ref] refs/pull/2475/head -> origin/pr/2475/head
 * [new ref] refs/pull/2475/merge -> origin/pr/2475/merge
 * [new ref] refs/pull/2476/head -> origin/pr/2476/head
 * [new ref] refs/pull/2476/merge -> origin/pr/2476/merge
 * [new ref] refs/pull/2477/head -> origin/pr/2477/head
 * [new ref] refs/pull/2477/merge -> origin/pr/2477/merge
 * [new ref] refs/pull/2478/head -> origin/pr/2478/head
 * [new ref] refs/pull/2478/merge -> origin/pr/2478/merge
 * [new ref] refs/pull/2479/head -> origin/pr/2479/head
 * [new ref] refs/pull/2479/merge -> origin/pr/2479/merge
 * [new ref] refs/pull/2480/head -> origin/pr/2480/head
 * [new ref] refs/pull/2480/merge -> origin/pr/2480/merge
 * [new ref] refs/pull/2481/head -> origin/pr/2481/head
 * [new ref] refs/pull/2481/merge -> origin/pr/2481/merge
 * [new ref] refs/pull/2482/head -> origin/pr/2482/head
 * [new ref] refs/pull/2482/merge -> origin/pr/2482/merge
 * [new ref] refs/pull/2483/head -> origin/pr/2483/head
 * [new ref] refs/pull/2483/merge -> origin/pr/2483/merge
 * [new ref] refs/pull/2484/head -> origin/pr/2484/head
 * [new ref] refs/pull/2484/merge -> origin/pr/2484/merge
 * [new ref] refs/pull/2485/head -> origin/pr/2485/head
 * [new ref] refs/pull/2485/merge -> origin/pr/2485/merge
 * [new ref] refs/pull/2486/head -> origin/pr/2486/head
 * [new ref] refs/pull/2487/head -> origin/pr/2487/head
 * [new ref] refs/pull/2487/merge -> origin/pr/2487/merge
 * [new ref] refs/pull/2488/head -> origin/pr/2488/head
 * [new ref] refs/pull/2488/merge -> origin/pr/2488/merge
 * [new ref] refs/pull/2489/head -> origin/pr/2489/head
 * [new ref] refs/pull/2489/merge -> origin/pr/2489/merge
 * [new ref] refs/pull/2490/head -> origin/pr/2490/head
 * [new ref] refs/pull/2490/merge -> origin/pr/2490/merge
 * [new ref] refs/pull/2491/head -> origin/pr/2491/head
 * [new ref] refs/pull/2492/head -> origin/pr/2492/head
 * [new ref] refs/pull/2492/merge -> origin/pr/2492/merge
 * [new ref] refs/pull/2493/head -> origin/pr/2493/head
 * [new ref] refs/pull/2493/merge -> origin/pr/2493/merge
 * [new ref] refs/pull/2494/head -> origin/pr/2494/head
 * [new ref] refs/pull/2495/head -> origin/pr/2495/head
 * [new ref] refs/pull/2495/merge -> origin/pr/2495/merge
 * [new ref] refs/pull/2496/head -> origin/pr/2496/head
 * [new ref] refs/pull/2496/merge -> origin/pr/2496/merge
 * [new ref] refs/pull/2497/head -> origin/pr/2497/head
 * [new ref] refs/pull/2497/merge -> origin/pr/2497/merge
 * [new ref] refs/pull/2498/head -> origin/pr/2498/head
 * [new ref] refs/pull/2498/merge -> origin/pr/2498/merge
 * [new ref] refs/pull/2499/head -> origin/pr/2499/head
 * [new ref] refs/pull/2499/merge -> origin/pr/2499/merge
 * [new ref] refs/pull/2500/head -> origin/pr/2500/head
 * [new ref] refs/pull/2501/head -> origin/pr/2501/head
 * [new ref] refs/pull/2501/merge -> origin/pr/2501/merge
 * [new ref] refs/pull/2502/head -> origin/pr/2502/head
 * [new ref] refs/pull/2502/merge -> origin/pr/2502/merge
 * [new ref] refs/pull/2503/head -> origin/pr/2503/head
 * [new ref] refs/pull/2504/head -> origin/pr/2504/head
 * [new ref] refs/pull/2505/head -> origin/pr/2505/head
 * [new ref] refs/pull/2505/merge -> origin/pr/2505/merge
 * [new ref] refs/pull/2506/head -> origin/pr/2506/head
 * [new ref] refs/pull/2506/merge -> origin/pr/2506/merge
 * [new ref] refs/pull/2507/head -> origin/pr/2507/head
 * [new ref] refs/pull/2507/merge -> origin/pr/2507/merge
 * [new ref] refs/pull/2508/head -> origin/pr/2508/head
 * [new ref] refs/pull/2508/merge -> origin/pr/2508/merge
 * [new ref] refs/pull/2509/head -> origin/pr/2509/head
 * [new ref]