Re: [VOTE] Release 2.1.0, release candidate #2

2017-08-07 Thread Jean-Baptiste Onofré

Hi Kenn,

As said, I just gave an extra couple of days to Stas and I to try to fix the 
issue. However, we didn't fix it yet, and I'm still struggling to find the exact 
cause as we have different tests failures.


So, I will cut RC3 as it is and we will fix the tests issue for 2.2.0 that we 
can release pretty quickly.

We are holding the release for too long (roughly a month).

Regards
JB

On 08/08/2017 01:27 AM, Kenneth Knowles wrote:

I agree with Eugene's proposal.

Suppose it takes  days to grok and fix CreateStreamTest. If we compare
delaying 2.1.0 versus releasing it immediately and starting 2.2.0:

- Users get 2.1.0 ASAP and then 2.2.0 in  days
- Users get 2.1.0 in  days

The now-failing tests were flaky, and we have some confidence that the
changes that caused the failing are good. So if this is an apparent
regression for a user, it is likely that they are in danger already.

A third alternative is that users get 2.1.0 ASAP, 2.2.0 ASAP after that to
keep the cadence going, and 2.3.0 after  days if we can't sort this
quickly. This is consistent with treating it as an existing and ongoing
bug, which it likely is.

Kenn

On Mon, Aug 7, 2017 at 4:02 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:


If https://issues.apache.org/jira/browse/BEAM-2671 is a 2.1.0 blocker then
https://issues.apache.org/jira/browse/BEAM-1868 also should be, because
it's a failure of another method in the same test and I suppose it
indicates brokenness to the same extent. Or both shouldn't.

Given the progress so far, the chances of resolving the JIRA quickly are
looking bleak to me now, and the release has been going on for almost 1
month, and many large improvements have been added to Beam HEAD since the
first RC was cut.

I'm still in favor of:
1) cutting 2.1.0 RC3 immediately, and acknowledging that streaming in Spark
runner in cluster mode is still (potentially) broken in this release - to
the same or smaller extent than in 2.0.0, so this is not a regression. The
extent is still not clear to me; I asked on the JIRA.
2) immediately or very soon after this 2.1.0, start cutting 2.2.0, and
target these issues to 2.2.0.

My argument is:
- 2.1.0 contains 2.5 months worth of new features, and releasing them will
benefit a lot of existing Beam users
- I don't think there are that many users for whom it's critically
important whether the first release with working Spark streaming will be
2.1.0 or 2.2.0, especially if we start cutting 2.2.0 very soon. This is
speculation though
- (subjective personal feeling) The release process requires participation
and momentum from community members, and letting it drag on for too long
loses that momentum.

We should anyway pursue resolving the issues asap, and users who were
eagerly waiting for Spark streaming to work properly can run Beam at HEAD
in the window between when they are first resolved and when 2.2.0 is
released.

What do you think?

On Sat, Aug 5, 2017 at 9:31 PM Jean-Baptiste Onofré 
wrote:


Another quick update.

Aviem updated the Jira as he and his team wants to take a look. I'm also
doing a
new bisect on my side. I've given an extra day to move forward. If we
don't have
clear statement tonight, then, I will cut the RC3 tonight or tomorrow
morning
(my time).

Regards
JB

On 08/05/2017 02:37 AM, Eugene Kirpichov wrote:

I did some more investigation on that JIRA
https://issues.apache.org/jira/browse/BEAM-2671 and my conclusion is:

We need to postpone that JIRA to 2.2.0 and finalize release 2.1.0

as-is.


The TL;DR of my investigation is that:
- We have some confidence that Spark runner in 2.1.0 generally works
properly: it passes ValidatesRunner tests, and there's been some amount

of

manual testing.
- Release 2.0.0 does not contain a critical fix and, if I understand
correctly, Spark runner at 2.0.0 was basically unusable in streaming
cluster mode.
- So, even if the JIRA signals that there is something wrong in the

Spark

runner at 2.1.0, it's definitely better than 2.0.0 so there is no
regression for the user.

I moved the JIRA to 2.2.0 so there are no blocking issues remaining for
2.1.0. JB - the next step is for you to proceed with cutting the RC,
correct?

Thanks.

On Thu, Aug 3, 2017 at 7:04 AM Jean-Baptiste Onofré 

wrote:



Another quick update. Regarding BEAM-2671, I asked help from Stas and
Aviem on
this one. It's our high priority as it's the main blocking issue

before

cutting RC3.

At some point, if we are not able to move fast on this one, I would
propose to
cut RC3 as it is.

Regards
JB

On 08/02/2017 08:52 PM, Jean-Baptiste Onofré wrote:

Hi,

Thanks Eugene for the sumup.

BEAM-2708 is now fixed.

The last blocking issue for RC3 is BEAM-2671. I spent time today on

this

one,

investigating the different issues.

Agree that help from Aviem and Kenn would help for sure.

Aviem already started to kindly take a look on the Jira today.

Clearly, it would be great to fix BEAM-2671 in the coming 36 

Re: [PROPOSAL] Merge gearpump-runner to master

2017-08-07 Thread Paul Findlay
Cheers team.

I have found a compilation issue on master (in
CreateGearpumpPCollectionView), attached is a small patch

Kind regards,

Paul

On Tue, Aug 8, 2017 at 4:10 PM, Manu Zhang  wrote:

> Thanks Kenn!!! Thanks everyone!!! It's a great achievement for us.
>
> On Tue, Aug 8, 2017 at 7:54 AM Kenneth Knowles 
> wrote:
>
> > Done!
> >
> > On Fri, Jul 21, 2017 at 11:08 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > +1
> > >
> > > Regards
> > > JB
> > >
> > > On Jul 22, 2017, 05:06, at 05:06, Kenneth Knowles
>  > >
> > > wrote:
> > > >+1 to this!
> > > >
> > > >I really want to call out the longevity of contribution behind this,
> > > >following many changes in both Beam and Gearpump for over a year.
> > > >Here's
> > > >the first commit on the branch:
> > > >
> > > >commit 9478f4117de3a2d0ea40614ed4cb801918610724 (github/pr/323)
> > > >Author: manuzhang 
> > > >Date:   Tue Mar 15 16:15:16 2016 +0800
> > > >
> > > >And here are some numbers, FWIW: 163 non-merge commits, 203 total. So
> > > >that's a PR and review every couple of weeks.
> > > >
> > > >The ValidatesRunner capability coverage is very good. The only skipped
> > > >tests are state/timers, metrics, and TestStream, which many runners
> > > >have
> > > >partial or no support for.
> > > >
> > > >I'll save practical TODOs like moving ValidatesRunner execution to
> > > >postcommit, etc. Pending the results of this discussion, of course.
> > > >
> > > >Kenn
> > > >
> > > >
> > > >On Fri, Jul 21, 2017 at 12:02 AM, Manu Zhang  >
> > > >wrote:
> > > >
> > > >> Guys,
> > > >>
> > > >> On behalf of the gearpump team, I'd like to propose to merge the
> > > >> gearpump-runner branch into master, which will give it more
> > > >visibility to
> > > >> other contributors and users. The runner satisfies the following
> > > >criteria
> > > >> outlined in contribution guide [1].
> > > >>
> > > >>
> > > >>1. Have at least 2 contributors interested in maintaining it, and
> > > >1
> > > >>committer interested in supporting it: *Both Huafeng and me have
> > > >been
> > > >>making contributions[2] and we will continue to maintain it. Kenn
> > > >and JB
> > > >>have been supporting the runner (Thank you, guys!)*
> > > >>2. Provide both end-user and developer-facing documentation*:
> They
> > > >are
> > > >>already on the website ([3] and [4]).*
> > > >>3. Have at least a basic level of unit test coverage: *We do.*
> > > >*[5]*
> > > >>4. Run all existing applicable integration tests with other Beam
> > > >>components and create additional tests as appropriate:
> > > >*gearpump-runner
> > > >>passes ValidatesRunner tests.*
> > > >>
> > > >>
> > > >> Additionally, as a runner,
> > > >>
> > > >>
> > > >>1. Be able to handle a subset of the model that address a
> > > >significant
> > > >>set of use cases (aka. ‘traditional batch’ or ‘processing time
> > > >> streaming’): *gearpump
> > > >>runner is able to handle event time streaming *
> > > >>2. Update the capability matrix with the current status: *[4]*
> > > >>3. Add a webpage under documentation/runners: *[3]*
> > > >>
> > > >>
> > > >> The PR for the merge: https://github.com/apache/beam/pull/3611
> > > >>
> > > >> Thanks,
> > > >> Manu
> > > >>
> > > >>
> > > >> [1]
> > > >http://beam.apache.org/contribute/contribution-guide/
> #feature-branches
> > > >> [2] https://issues.apache.org/jira/browse/BEAM-79
> > > >> [3] https://beam.apache.org/documentation/runners/gearpump/
> > > >> [4] https://beam.apache.org/documentation/runners/
> capability-matrix/
> > > >> [5]
> > > >> https://github.com/apache/beam/tree/gearpump-runner/
> > > >> runners/gearpump/src/test/java/org/apache/beam/runners/gearpump
> > > >>
> > >
> >
>
Index: runners/gearpump/src/main/java/org/apache/beam/runners/gearpump/translators/CreateStreamingGearpumpView.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
--- runners/gearpump/src/main/java/org/apache/beam/runners/gearpump/translators/CreateStreamingGearpumpView.java	(revision 89236e3b588cd8ab5092667b1fe155738c69dbcf)
+++ runners/gearpump/src/main/java/org/apache/beam/runners/gearpump/translators/CreateStreamingGearpumpView.java	(revision )
@@ -18,11 +18,6 @@
 package org.apache.beam.runners.gearpump.translators;
 
 import com.google.common.collect.Iterables;
-
-import java.util.ArrayList;
-import java.util.List;
-import java.util.Map;
-
 import org.apache.beam.runners.core.construction.ReplacementOutputs;
 import org.apache.beam.sdk.coders.Coder;
 import org.apache.beam.sdk.coders.CoderRegistry;
@@ -37,6 +32,10 @@
 import org.apache.beam.sdk.values.PValue;
 import org.apache.beam.sdk.values.TupleTag;
 
+import java.util.ArrayList;
+import java.util.List;
+import 

Re: [PROPOSAL] Merge gearpump-runner to master

2017-08-07 Thread Manu Zhang
Thanks Kenn!!! Thanks everyone!!! It's a great achievement for us.

On Tue, Aug 8, 2017 at 7:54 AM Kenneth Knowles 
wrote:

> Done!
>
> On Fri, Jul 21, 2017 at 11:08 PM, Jean-Baptiste Onofré 
> wrote:
>
> > +1
> >
> > Regards
> > JB
> >
> > On Jul 22, 2017, 05:06, at 05:06, Kenneth Knowles  >
> > wrote:
> > >+1 to this!
> > >
> > >I really want to call out the longevity of contribution behind this,
> > >following many changes in both Beam and Gearpump for over a year.
> > >Here's
> > >the first commit on the branch:
> > >
> > >commit 9478f4117de3a2d0ea40614ed4cb801918610724 (github/pr/323)
> > >Author: manuzhang 
> > >Date:   Tue Mar 15 16:15:16 2016 +0800
> > >
> > >And here are some numbers, FWIW: 163 non-merge commits, 203 total. So
> > >that's a PR and review every couple of weeks.
> > >
> > >The ValidatesRunner capability coverage is very good. The only skipped
> > >tests are state/timers, metrics, and TestStream, which many runners
> > >have
> > >partial or no support for.
> > >
> > >I'll save practical TODOs like moving ValidatesRunner execution to
> > >postcommit, etc. Pending the results of this discussion, of course.
> > >
> > >Kenn
> > >
> > >
> > >On Fri, Jul 21, 2017 at 12:02 AM, Manu Zhang 
> > >wrote:
> > >
> > >> Guys,
> > >>
> > >> On behalf of the gearpump team, I'd like to propose to merge the
> > >> gearpump-runner branch into master, which will give it more
> > >visibility to
> > >> other contributors and users. The runner satisfies the following
> > >criteria
> > >> outlined in contribution guide [1].
> > >>
> > >>
> > >>1. Have at least 2 contributors interested in maintaining it, and
> > >1
> > >>committer interested in supporting it: *Both Huafeng and me have
> > >been
> > >>making contributions[2] and we will continue to maintain it. Kenn
> > >and JB
> > >>have been supporting the runner (Thank you, guys!)*
> > >>2. Provide both end-user and developer-facing documentation*: They
> > >are
> > >>already on the website ([3] and [4]).*
> > >>3. Have at least a basic level of unit test coverage: *We do.*
> > >*[5]*
> > >>4. Run all existing applicable integration tests with other Beam
> > >>components and create additional tests as appropriate:
> > >*gearpump-runner
> > >>passes ValidatesRunner tests.*
> > >>
> > >>
> > >> Additionally, as a runner,
> > >>
> > >>
> > >>1. Be able to handle a subset of the model that address a
> > >significant
> > >>set of use cases (aka. ‘traditional batch’ or ‘processing time
> > >> streaming’): *gearpump
> > >>runner is able to handle event time streaming *
> > >>2. Update the capability matrix with the current status: *[4]*
> > >>3. Add a webpage under documentation/runners: *[3]*
> > >>
> > >>
> > >> The PR for the merge: https://github.com/apache/beam/pull/3611
> > >>
> > >> Thanks,
> > >> Manu
> > >>
> > >>
> > >> [1]
> > >http://beam.apache.org/contribute/contribution-guide/#feature-branches
> > >> [2] https://issues.apache.org/jira/browse/BEAM-79
> > >> [3] https://beam.apache.org/documentation/runners/gearpump/
> > >> [4] https://beam.apache.org/documentation/runners/capability-matrix/
> > >> [5]
> > >> https://github.com/apache/beam/tree/gearpump-runner/
> > >> runners/gearpump/src/test/java/org/apache/beam/runners/gearpump
> > >>
> >
>


Re: Proposed API for a Whole File IO

2017-08-07 Thread Chris Hebert
Hello again,

I created tickets to capture these requests:
https://issues.apache.org/jira/browse/BEAM-2750
https://issues.apache.org/jira/browse/BEAM-2751

I've started working on the Write part.

Robert, after some time working on this, I'm unable to see how these
objectives can be "entirely done in the context of a DoFn". Could you lend
a hint?

I assume you didn't mean for me to append to my pipeline a ParDo.of(new
DoFn... that manually writes out to some file location, did you? That would
lose all the benefits of the IO/Sink classes.

That said, I've found the "sharding" logic to be deeply embedded in all the
FileBaseSink derivatives, and my attempt at sidestepping this logic isn't
going very well. I managed to write a FileBasedSink that writes Byte[] out
(and that works correctly), but figuring out how to get the FIlenamePolicy
to be different for each element written out seems counter to the intent of
much of these classes.

Chris

On Wed, Aug 2, 2017 at 10:23 AM, Reuven Lax 
wrote:

> On Wed, Aug 2, 2017 at 7:49 AM, Chris Hebert <
> chris.hebert-...@digitalreasoning.com> wrote:
>
> > Thanks for the feedback!
> >
> > Aggregated thoughts:
> >
> >1. Warn users about large files (like 5GB large)
> >
>
> I would set the threshold smaller. Also remember, that while you may warn,
> some runners might simply fail to process the record causing pipelines to
> either get stuck or fail all together.
>
>
> >2. Filenames can stick with contents via PCollection >contents>>
> >3. InputStreams can't be encoded directly, but could be referenced in
> a
> >FileWrapper object
> >4. Be mindful of sink race conditions with multiple workers; make sure
> >failed workers cleanup incompletely written files
> >5. File systems often do better with few large files than many small
> > ones
> >6. Most/all of this can be done in the context of a DoFn
> >
> > # Regarding point 1 and point 2
> > Yes!
> >
> > # Regarding point 3:
> >
> > ## Approach A:
> > When the FileWrapper is encoded, it must somehow encode a reference to
> the
> > InputStream it is associated with, so that when the FileWrapper is
> decoded
> > it can pick up that InputStream again. My Java knowledge isn't deep
> enough
> > to know how one would do that with hashcodes and object lookup and such,
> > but I could sidestep that entirely by simply encoding the filepath with
> the
> > FileWrapper, then open up a new InputStream on that file path every time
> > the FileWrapper is decoded.
> >
> > ## Approach B:
> > An alternative to the above technique is to simply pass a byte[] array,
> > like so:
> > PCollection> fileNamesAndBytes = p.apply("Read",
> > WholeFileIO.read().from("/path/to/input/dir/*"));
> >
> > That would solve the encoding problem, allow users to get whatever they
> > want out of it with a ByteArrayInputStream, AND put a hard limit on the
> > size of incoming files at just below 2 GB (if my math is right). (This is
> > large enough for my use case, at present.)
> >
> >
> > # Regarding point 4:
> >
> > Any examples or guidance I could pull from to protect against this
> > properly?
> >
> >
> > # Regarding point 5:
> >
> > TextIO can read and write with different compressions. Would it be
> feasible
> > for this WholeFileIO to read and write these many files to compressed zip
> > files also? (I envision this as a stretch feature that would be added
> after
> > the initial iteration anyway.)
> >
> >
> > # Regarding point 6:
> >
> > The only prebuilt IO thing I've found find in Beam that uses DoFn is
> > WriteFiles. Do you have any examples to point towards to enlighten me on
> > the use of DoFn in this context? Unfortunately, we all know the
> "Authoring
> > I/O Transforms" documentation is sparse.
> >
> >
> > Keep it coming,
> > Chris
> >
> > On Tue, Aug 1, 2017 at 3:55 PM, Robert Bradshaw
> >  > > wrote:
> >
> > > On Tue, Aug 1, 2017 at 1:42 PM, Eugene Kirpichov <
> > > kirpic...@google.com.invalid> wrote:
> > >
> > > > Hi,
> > > > As mentioned on the PR - I support the creation of such an IO (both
> > read
> > > > and write) with the caveats that Reuven mentioned; we can refine the
> > > naming
> > > > during code review.
> > > > Note that you won't be able to create a PCollection
> > because
> > > > elements of a PCollection must have a coder and it's not possible to
> > > > provide a coder for InputStream.
> > >
> > >
> > > Well, it's possible, but the fact that InputStream is mutable may cause
> > > issues (e.g. if there's fusion, or when estimating its size).
> > >
> > > I would probably let the API consume/produce a PCollection > > contents>>. Alternatively, a FileWrapper object of some kind could
> > provide
> > > accessors to InputStream (or otherwise facilitate lazy reading).
> > >
> > > Note for the sink one must take care there's no race in case multiple
> > > workers are attempting to process the same 

Build failed in Jenkins: beam_SeedJob #376

2017-08-07 Thread Apache Jenkins Server
See 

--
GitHub pull request #3668 of commit 9441564998f8dfc82e39d51fff1db75af256cad6, 
no merge conflicts.
Setting status of 9441564998f8dfc82e39d51fff1db75af256cad6 to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/376/ and message: 'Build started 
sha1 is merged.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam6 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/3668/*:refs/remotes/origin/pr/3668/*
 > git rev-parse refs/remotes/origin/pr/3668/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/3668/merge^{commit} # timeout=10
Checking out Revision 5681116b2dd7375e0017eeaf94c83606eac5fb32 
(refs/remotes/origin/pr/3668/merge)
Commit message: "Merge 9441564998f8dfc82e39d51fff1db75af256cad6 into 
c9abd15e520cd2e6abd99a82b74ed93436bf0409"
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5681116b2dd7375e0017eeaf94c83606eac5fb32
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
[EnvInject] - Executing scripts and injecting environment variables after the 
SCM step.
[EnvInject] - Injecting as environment variables the properties content 
SPARK_LOCAL_IP=127.0.0.1

[EnvInject] - Variables injected successfully.
Processing DSL script job_beam_PerformanceTests_Dataflow.groovy
Processing DSL script job_beam_PerformanceTests_JDBC.groovy
ERROR: (job_beam_PerformanceTests_JDBC.groovy, line 38) No signature of method: 
static common_job_properties.enablePhraseTriggeringFromPullRequest() is 
applicable for argument types: (javaposse.jobdsl.dsl.jobs.FreeStyleJob, 
java.lang.String) values: [javaposse.jobdsl.dsl.jobs.FreeStyleJob@51a40f11, Run 
JDBC Performance Test]
Possible solutions: enablePhraseTriggeringFromPullRequest(java.lang.Object, 
java.lang.String, java.lang.String)
Not sending mail to unregistered user kirpic...@google.com


Re: [VOTE] Release 2.1.0, release candidate #2

2017-08-07 Thread Eugene Kirpichov
If https://issues.apache.org/jira/browse/BEAM-2671 is a 2.1.0 blocker then
https://issues.apache.org/jira/browse/BEAM-1868 also should be, because
it's a failure of another method in the same test and I suppose it
indicates brokenness to the same extent. Or both shouldn't.

Given the progress so far, the chances of resolving the JIRA quickly are
looking bleak to me now, and the release has been going on for almost 1
month, and many large improvements have been added to Beam HEAD since the
first RC was cut.

I'm still in favor of:
1) cutting 2.1.0 RC3 immediately, and acknowledging that streaming in Spark
runner in cluster mode is still (potentially) broken in this release - to
the same or smaller extent than in 2.0.0, so this is not a regression. The
extent is still not clear to me; I asked on the JIRA.
2) immediately or very soon after this 2.1.0, start cutting 2.2.0, and
target these issues to 2.2.0.

My argument is:
- 2.1.0 contains 2.5 months worth of new features, and releasing them will
benefit a lot of existing Beam users
- I don't think there are that many users for whom it's critically
important whether the first release with working Spark streaming will be
2.1.0 or 2.2.0, especially if we start cutting 2.2.0 very soon. This is
speculation though
- (subjective personal feeling) The release process requires participation
and momentum from community members, and letting it drag on for too long
loses that momentum.

We should anyway pursue resolving the issues asap, and users who were
eagerly waiting for Spark streaming to work properly can run Beam at HEAD
in the window between when they are first resolved and when 2.2.0 is
released.

What do you think?

On Sat, Aug 5, 2017 at 9:31 PM Jean-Baptiste Onofré  wrote:

> Another quick update.
>
> Aviem updated the Jira as he and his team wants to take a look. I'm also
> doing a
> new bisect on my side. I've given an extra day to move forward. If we
> don't have
> clear statement tonight, then, I will cut the RC3 tonight or tomorrow
> morning
> (my time).
>
> Regards
> JB
>
> On 08/05/2017 02:37 AM, Eugene Kirpichov wrote:
> > I did some more investigation on that JIRA
> > https://issues.apache.org/jira/browse/BEAM-2671 and my conclusion is:
> >
> > We need to postpone that JIRA to 2.2.0 and finalize release 2.1.0 as-is.
> >
> > The TL;DR of my investigation is that:
> > - We have some confidence that Spark runner in 2.1.0 generally works
> > properly: it passes ValidatesRunner tests, and there's been some amount
> of
> > manual testing.
> > - Release 2.0.0 does not contain a critical fix and, if I understand
> > correctly, Spark runner at 2.0.0 was basically unusable in streaming
> > cluster mode.
> > - So, even if the JIRA signals that there is something wrong in the Spark
> > runner at 2.1.0, it's definitely better than 2.0.0 so there is no
> > regression for the user.
> >
> > I moved the JIRA to 2.2.0 so there are no blocking issues remaining for
> > 2.1.0. JB - the next step is for you to proceed with cutting the RC,
> > correct?
> >
> > Thanks.
> >
> > On Thu, Aug 3, 2017 at 7:04 AM Jean-Baptiste Onofré 
> wrote:
> >
> >> Another quick update. Regarding BEAM-2671, I asked help from Stas and
> >> Aviem on
> >> this one. It's our high priority as it's the main blocking issue before
> >> cutting RC3.
> >>
> >> At some point, if we are not able to move fast on this one, I would
> >> propose to
> >> cut RC3 as it is.
> >>
> >> Regards
> >> JB
> >>
> >> On 08/02/2017 08:52 PM, Jean-Baptiste Onofré wrote:
> >>> Hi,
> >>>
> >>> Thanks Eugene for the sumup.
> >>>
> >>> BEAM-2708 is now fixed.
> >>>
> >>> The last blocking issue for RC3 is BEAM-2671. I spent time today on
> this
> >> one,
> >>> investigating the different issues.
> >>>
> >>> Agree that help from Aviem and Kenn would help for sure.
> >>>
> >>> Aviem already started to kindly take a look on the Jira today.
> >>>
> >>> Clearly, it would be great to fix BEAM-2671 in the coming 36 hours. I
> >> would like
> >>> to submit RC3 to vote tomorrow or the day after (my time).
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> On 08/02/2017 08:24 PM, Eugene Kirpichov wrote:
>  We're down to 2 issues.
> 
>  BEAM-2670 has been fixed.
>  https://issues.apache.org/jira/browse/BEAM-2708 has a fix in review
> 
>  https://issues.apache.org/jira/browse/BEAM-2671 is the nasty one and
> we
>  don't understand it nor have a fix. Help is needed; some people who
> >> could
>  help are +Kenn Knowles  and +Aviem Zur <
> >> aviem...@gmail.com>
> .
> 
>  On Thu, Jul 27, 2017 at 6:41 AM Jean-Baptiste Onofré  >
>  wrote:
> 
> > Hi guys,
> >
> > We have three open issues for the 2.1.0 that we need to fix before I
> >> will
> > be
> > able to cut RC3:
> >
> > https://issues.apache.org/jira/projects/BEAM/versions/12340528
> >
> > I'm working on BEAM-2671.
> 

Re: Proposal : An extension for sketch-based statistics

2017-08-07 Thread Kenneth Knowles
This is a great development! I have wanted Beam to have a library of
sketches.

What Eugene is referring to is the fact that you can write
Combine.perKey(combineFn) to use these in a transform but also
StateSpecs.combiningState(combineFn) to use them in a stateful ParDo. So it
is good to make the CombineFn public and refine their constructors to be
user-friendly.

Kenn

On Fri, Aug 4, 2017 at 7:45 AM, Arnaud Fournier  wrote:

> Thanks for your comments, that is very encouraging !
>
> I have created a Jira : https://issues.apache.org/jira/browse/BEAM-2728
> and a PR : https://github.com/apache/beam/pull/3686
>
> Eugene and Lucas I saw that you already have some ideas so I put you as
> reviewers,
> I look forward to hear more from you.
>
> With Ismael and JB, we already thought about using some of these indicators
> as metric cells,
> as it can be useful for some kinds of monitoring.
> But I have never heard about state cells, is it something like the
> QuantileState in ApproximateQuantiles ?
>
>
>
> 2017-08-04 3:14 GMT+02:00 Anand Iyer :
>
> > This is awesome!! Very exciting to see the addition of statistical and
> > data-mining algorithms to Apache Beam.
> >
> > On Thu, Aug 3, 2017 at 2:32 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > +1, Very exciting! I have some suggestions on the exact API to expose
> > (e.g.
> > > I think it makes sense to expose the CombineFn's directly, so that they
> > can
> > > also be used for combining state cells and not just as PTransforms),
> but
> > > that can be handled during regular code review.
> > >
> > > On Thu, Aug 3, 2017 at 2:23 PM Sourabh Bajaj
> > >  wrote:
> > >
> > > > +1 to this.
> > > >
> > > > On Thu, Aug 3, 2017 at 6:28 AM Lukasz Cwik  >
> > > > wrote:
> > > >
> > > > > I'm most interested in the frequency / cardinality tools as it
> could
> > be
> > > > > used to help improve performance automatically for combiners by
> > > detecting
> > > > > the few keys case or automatically handle hot keys without needing
> > > users
> > > > to
> > > > > specify the hints when they use a combiner.
> > > > >
> > > > > On Thu, Aug 3, 2017 at 5:35 AM, Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > > > wrote:
> > > > >
> > > > > > Nice work Arnaud ;)
> > > > > >
> > > > > > Happy to have been able to help.
> > > > > >
> > > > > > Let's see what the others will think about this.
> > > > > >
> > > > > > Regards
> > > > > > JB
> > > > > >
> > > > > >
> > > > > > On 08/03/2017 02:32 PM, Arnaud Fournier wrote:
> > > > > >
> > > > > >> Hello everyone,
> > > > > >>
> > > > > >> My name is Arnaud Fournier and I am a CS student. I am currently
> > > doing
> > > > > an
> > > > > >> internship at Talend.
> > > > > >>
> > > > > >> With the support of Jean-Baptiste Onofre and Ismaël Mejia, I
> have
> > > been
> > > > > >> working on statistical analysis of streams with Beam, using
> > > > > probabilistic
> > > > > >> data structures like HyperLogLog.
> > > > > >>
> > > > > >> I would like to share this work with the community, but I wanted
> > > first
> > > > > to
> > > > > >> show you my work in progress and ask you if this humble
> > contribution
> > > > > could
> > > > > >> be interesting as an extension.
> > > > > >>
> > > > > >> I have made a little doc with more details about what I have
> done
> > in
> > > > > case
> > > > > >> you are interested and want to give me some feedback :
> > > > > >> *https://docs.google.com/document/d/1Xy6g5RPBYX_HadpIr_2WrUe
> > > > > >> usiwL0Jo2ACI5PEOP1kc/edit*
> > > > > >>  > > > > >> usiwL0Jo2ACI5PEOP1kc/edit>
> > > > > >>
> > > > > >> You can also find the current work implementation in progress
> here
> > > :
> > > > > >>
> > > > > >> https://github.com/ArnaudFnr/beam/tree/sketching/sdks/java/e
> > > > > >> xtensions/sketching
> > > > > >>
> > > > > >>
> > > > > >>  > > > > >> extensions/sketching>
> > > > > >>
> > > > > >> Thanks !
> > > > > >>
> > > > > >> Arnaud
> > > > > >>
> > > > > >>
> > > > > > --
> > > > > > Jean-Baptiste Onofré
> > > > > > jbono...@apache.org
> > > > > > http://blog.nanthrax.net
> > > > > > Talend - http://www.talend.com
> > > > > >
> > > > >
> > > >
> > >
> >
>


Jenkins build is back to normal : beam_SeedJob #377

2017-08-07 Thread Apache Jenkins Server
See 



Re: [VOTE] Release 2.1.0, release candidate #2

2017-08-07 Thread Kenneth Knowles
I agree with Eugene's proposal.

Suppose it takes  days to grok and fix CreateStreamTest. If we compare
delaying 2.1.0 versus releasing it immediately and starting 2.2.0:

   - Users get 2.1.0 ASAP and then 2.2.0 in  days
   - Users get 2.1.0 in  days

The now-failing tests were flaky, and we have some confidence that the
changes that caused the failing are good. So if this is an apparent
regression for a user, it is likely that they are in danger already.

A third alternative is that users get 2.1.0 ASAP, 2.2.0 ASAP after that to
keep the cadence going, and 2.3.0 after  days if we can't sort this
quickly. This is consistent with treating it as an existing and ongoing
bug, which it likely is.

Kenn

On Mon, Aug 7, 2017 at 4:02 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> If https://issues.apache.org/jira/browse/BEAM-2671 is a 2.1.0 blocker then
> https://issues.apache.org/jira/browse/BEAM-1868 also should be, because
> it's a failure of another method in the same test and I suppose it
> indicates brokenness to the same extent. Or both shouldn't.
>
> Given the progress so far, the chances of resolving the JIRA quickly are
> looking bleak to me now, and the release has been going on for almost 1
> month, and many large improvements have been added to Beam HEAD since the
> first RC was cut.
>
> I'm still in favor of:
> 1) cutting 2.1.0 RC3 immediately, and acknowledging that streaming in Spark
> runner in cluster mode is still (potentially) broken in this release - to
> the same or smaller extent than in 2.0.0, so this is not a regression. The
> extent is still not clear to me; I asked on the JIRA.
> 2) immediately or very soon after this 2.1.0, start cutting 2.2.0, and
> target these issues to 2.2.0.
>
> My argument is:
> - 2.1.0 contains 2.5 months worth of new features, and releasing them will
> benefit a lot of existing Beam users
> - I don't think there are that many users for whom it's critically
> important whether the first release with working Spark streaming will be
> 2.1.0 or 2.2.0, especially if we start cutting 2.2.0 very soon. This is
> speculation though
> - (subjective personal feeling) The release process requires participation
> and momentum from community members, and letting it drag on for too long
> loses that momentum.
>
> We should anyway pursue resolving the issues asap, and users who were
> eagerly waiting for Spark streaming to work properly can run Beam at HEAD
> in the window between when they are first resolved and when 2.2.0 is
> released.
>
> What do you think?
>
> On Sat, Aug 5, 2017 at 9:31 PM Jean-Baptiste Onofré 
> wrote:
>
> > Another quick update.
> >
> > Aviem updated the Jira as he and his team wants to take a look. I'm also
> > doing a
> > new bisect on my side. I've given an extra day to move forward. If we
> > don't have
> > clear statement tonight, then, I will cut the RC3 tonight or tomorrow
> > morning
> > (my time).
> >
> > Regards
> > JB
> >
> > On 08/05/2017 02:37 AM, Eugene Kirpichov wrote:
> > > I did some more investigation on that JIRA
> > > https://issues.apache.org/jira/browse/BEAM-2671 and my conclusion is:
> > >
> > > We need to postpone that JIRA to 2.2.0 and finalize release 2.1.0
> as-is.
> > >
> > > The TL;DR of my investigation is that:
> > > - We have some confidence that Spark runner in 2.1.0 generally works
> > > properly: it passes ValidatesRunner tests, and there's been some amount
> > of
> > > manual testing.
> > > - Release 2.0.0 does not contain a critical fix and, if I understand
> > > correctly, Spark runner at 2.0.0 was basically unusable in streaming
> > > cluster mode.
> > > - So, even if the JIRA signals that there is something wrong in the
> Spark
> > > runner at 2.1.0, it's definitely better than 2.0.0 so there is no
> > > regression for the user.
> > >
> > > I moved the JIRA to 2.2.0 so there are no blocking issues remaining for
> > > 2.1.0. JB - the next step is for you to proceed with cutting the RC,
> > > correct?
> > >
> > > Thanks.
> > >
> > > On Thu, Aug 3, 2017 at 7:04 AM Jean-Baptiste Onofré 
> > wrote:
> > >
> > >> Another quick update. Regarding BEAM-2671, I asked help from Stas and
> > >> Aviem on
> > >> this one. It's our high priority as it's the main blocking issue
> before
> > >> cutting RC3.
> > >>
> > >> At some point, if we are not able to move fast on this one, I would
> > >> propose to
> > >> cut RC3 as it is.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 08/02/2017 08:52 PM, Jean-Baptiste Onofré wrote:
> > >>> Hi,
> > >>>
> > >>> Thanks Eugene for the sumup.
> > >>>
> > >>> BEAM-2708 is now fixed.
> > >>>
> > >>> The last blocking issue for RC3 is BEAM-2671. I spent time today on
> > this
> > >> one,
> > >>> investigating the different issues.
> > >>>
> > >>> Agree that help from Aviem and Kenn would help for sure.
> > >>>
> > >>> Aviem already started to kindly take a look on the Jira today.
> > >>>
> > >>> Clearly, it would be great to fix BEAM-2671 

Re: [PROPOSAL] Merge gearpump-runner to master

2017-08-07 Thread Kenneth Knowles
Done!

On Fri, Jul 21, 2017 at 11:08 PM, Jean-Baptiste Onofré 
wrote:

> +1
>
> Regards
> JB
>
> On Jul 22, 2017, 05:06, at 05:06, Kenneth Knowles 
> wrote:
> >+1 to this!
> >
> >I really want to call out the longevity of contribution behind this,
> >following many changes in both Beam and Gearpump for over a year.
> >Here's
> >the first commit on the branch:
> >
> >commit 9478f4117de3a2d0ea40614ed4cb801918610724 (github/pr/323)
> >Author: manuzhang 
> >Date:   Tue Mar 15 16:15:16 2016 +0800
> >
> >And here are some numbers, FWIW: 163 non-merge commits, 203 total. So
> >that's a PR and review every couple of weeks.
> >
> >The ValidatesRunner capability coverage is very good. The only skipped
> >tests are state/timers, metrics, and TestStream, which many runners
> >have
> >partial or no support for.
> >
> >I'll save practical TODOs like moving ValidatesRunner execution to
> >postcommit, etc. Pending the results of this discussion, of course.
> >
> >Kenn
> >
> >
> >On Fri, Jul 21, 2017 at 12:02 AM, Manu Zhang 
> >wrote:
> >
> >> Guys,
> >>
> >> On behalf of the gearpump team, I'd like to propose to merge the
> >> gearpump-runner branch into master, which will give it more
> >visibility to
> >> other contributors and users. The runner satisfies the following
> >criteria
> >> outlined in contribution guide [1].
> >>
> >>
> >>1. Have at least 2 contributors interested in maintaining it, and
> >1
> >>committer interested in supporting it: *Both Huafeng and me have
> >been
> >>making contributions[2] and we will continue to maintain it. Kenn
> >and JB
> >>have been supporting the runner (Thank you, guys!)*
> >>2. Provide both end-user and developer-facing documentation*: They
> >are
> >>already on the website ([3] and [4]).*
> >>3. Have at least a basic level of unit test coverage: *We do.*
> >*[5]*
> >>4. Run all existing applicable integration tests with other Beam
> >>components and create additional tests as appropriate:
> >*gearpump-runner
> >>passes ValidatesRunner tests.*
> >>
> >>
> >> Additionally, as a runner,
> >>
> >>
> >>1. Be able to handle a subset of the model that address a
> >significant
> >>set of use cases (aka. ‘traditional batch’ or ‘processing time
> >> streaming’): *gearpump
> >>runner is able to handle event time streaming *
> >>2. Update the capability matrix with the current status: *[4]*
> >>3. Add a webpage under documentation/runners: *[3]*
> >>
> >>
> >> The PR for the merge: https://github.com/apache/beam/pull/3611
> >>
> >> Thanks,
> >> Manu
> >>
> >>
> >> [1]
> >http://beam.apache.org/contribute/contribution-guide/#feature-branches
> >> [2] https://issues.apache.org/jira/browse/BEAM-79
> >> [3] https://beam.apache.org/documentation/runners/gearpump/
> >> [4] https://beam.apache.org/documentation/runners/capability-matrix/
> >> [5]
> >> https://github.com/apache/beam/tree/gearpump-runner/
> >> runners/gearpump/src/test/java/org/apache/beam/runners/gearpump
> >>
>


Jenkins build is still unstable: beam_SeedJob #375

2017-08-07 Thread Apache Jenkins Server
See 



Jenkins build is still unstable: beam_SeedJob #374

2017-08-07 Thread Apache Jenkins Server
See 




Re: Should Pipeline wait till all processing time timers fire before exit?

2017-08-07 Thread Kenneth Knowles
On Thu, Aug 3, 2017 at 7:05 AM, Lukasz Cwik 
wrote:

> On Wed, Jul 26, 2017 at 12:23 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
> > On Wed, Jul 26, 2017 at 7:45 AM, Lukasz Cwik 
> > wrote:
> > > Robert, in your case where output is being produced based upon a
> > heartbeat,
> > > either the watermark on the output went to infinity and all that data
> > being
> > > produced is droppable at which point the timer becomes droppable
> >
> > But why are these timers *more* droppable than the ones that were
> > scheduled previously (as per the original question)?
>
>
> Thinking about this some more in the context of DONE, drain, and update. In
> the drain case, the output watermark goes to infinity and I believe we
> should drop all timers. In the update case, the timers should be preserved
> like state from one pipeline to the next. Finally, this should never happen
> for the DONE case as this means we fired all timers and processed all data
> (which is different from drain since we know we may have dropped timers). I
> don't believe we will have a case where a timer watermark can be
> independent of its timestamp which effectively holds the output watermark.
> If there was a case where the output watermark could be controlled
> independently of the timers then I could see that you would still fire
> existing timers but prevent scheduling new timers since new timers should
> be allowed to be dropped.
>
>
In the details of https://issues.apache.org/jira/browse/BEAM-2535 is the
fact that the timer loopback input channel needs a watermark separate from
the element input channel. The timestamp associated with the timer
(independent of its delivery logic) holds the watermark on that channel,
which automatically holds the output watermark. So it won't prevent new
timers from being set (I went through the same logic in my head, too).

If I read this discussion correctly, normal termination and update "just
work", and it seems like the most unresolved bit is whether or not we want
users to write loops equivalent to "while(true) { ... }" that we break when
we issue a drain command. The semantics are clearer if we just treat these
as infinite loops, but then naive uses of timers create undrainable
pipelines. But if drain can drop timers, making those loops drainable, it
seems like a situation the user has not considered, so possibly resulting
in data loss.

Kenn