Re: AfterWatermarkEarlyAndLate

2017-04-20 Thread Wesley Tanaka
Thanks for parsing the formatting on my original mail.  :)
What else is required for that change that I didn't outline?  I think there's a 
state machine class in runners core that would need to be deleted?  Is there 
anything else like a registry of triggers that might need to be edited?
Do you have a preference for the name of the final merged class?  Would you 
want to just make the top level class a Trigger subclass directly with private 
constructors? 
 
  On Thu, Apr 20, 2017 at 15:19, Kenneth Knowles 
wrote:   Hi Wesley,

I agree with getting rid of FromEndOfWindow and just having AfterWatermark.
I have wanted to do this for a while, so if someone else does it that would
be great.

This comes from the separation of OnceTrigger (triggers that fire just
once) and Trigger (triggers that may fire multiple times). But this
separation is not that important and a little fake - if you use
FromEndOfWindow and nonzero allowed lateness you can get multiple panes due
to final panes. The more important distinction is actually between triggers
that finish and triggers that don't finish...

Anyhow, even if we care about once vs not-once, we can just do analysis on
the constructed trigger instead of making the types and classes complex.

Kenn


On Thu, Apr 20, 2017 at 12:59 AM, Wesley Tanaka 
wrote:

> AfterWatermarkEarlyAndLate has:
>    public AfterWatermarkEarlyAndLate withEarlyFirings(OnceTrigger
> earlyTrigger)
>    public AfterWatermarkEarlyAndLate withLateFirings(OnceTrigger
> lateTrigger)
> FromEndOfWindow has:
>    public AfterWatermarkEarlyAndLate withEarlyFirings(OnceTrigger
> earlyFirings)  public AfterWatermarkEarlyAndLate
> withLateFirings(OnceTrigger lateFirings)
> As a means of trying to understand the trigger API, I am curious if it
> might make sense conceptually to:
> 1. rename AfterWatermarkEarlyAndLate => EarlyAndLate since it is already
> an inner class of AfterWatermark2. get rid of FromEndOfWindow by having
> AfterWatermark.pastEndOfWindow() return a new 
> AfterWatermarkEarlyAndLate(Never.ever(),
> null);
> Or is there a value to having FromEndOfWindow be separate that I am not
> understanding?
>
> ---
> Wesley Tanaka
> https://wtanaka.com/
  


Re: Let's make Beam transforms comply with PTransform Style Guide

2017-04-20 Thread Jean-Baptiste Onofré

No problem ;)

Happy to review if needed ;)

Regards
JB

On 04/21/2017 07:50 AM, Eugene Kirpichov wrote:

Guys, apologies, but I already have Kinesis in review, and Pubsub ready for
review. I'm afraid there's not much left for volunteers to take on right
now.

On Thu, Apr 20, 2017 at 10:47 PM Jean-Baptiste Onofré 
wrote:


Cool, I gonna take a look on PubSub later today (I would like to finish
CassandraIO, HDFS refactoring and Spark 2 support first ;)).

Regards
JB

On 04/21/2017 06:03 AM, tarush grover wrote:

Hi,

I can take kinesis one.

Regards,
Tarush


On Thu, 20 Apr 2017 at 11:18 AM, Jean-Baptiste Onofré 
wrote:


Gonna take a look on the pending IOs.

Thanks !
Regards
JB

On 04/19/2017 10:05 PM, Eugene Kirpichov wrote:

A few more knocked down
- I finished Map/FlatMap, XML, TFRecordIO
- I'm working on CountingInput; it's nontrivial.
- Reuven is working on Text/Avro
- @peay is working on removing coders from KafkaIO

Kinesis and PubsubIO remain; of these, Kinesis is the easier one.

Any takers?

On Fri, Apr 7, 2017 at 10:47 PM Jean-Baptiste Onofré 
wrote:


Hi Eugene,

thanks for the update. I'm volunteer to tackle some those IOs (and

make

them
conform with PTransform style guide). I'm pretty sure other people

will

jump on ;)

Regards
JB

On 04/08/2017 12:20 AM, Eugene Kirpichov wrote:

Hey all,

More progress has been made and we're nearing completion. ParDo,

BigQueryIO

and Window are fixed; Map/FlatMapElements are in review.

The remaining unclaimed ones are all IOs of some form, and here's a

list.

I've marked them all as "starter" in JIRA.

XML - https://issues.apache.org/jira/browse/BEAM-1914
TFRecordIO (Tensorflow) -

https://issues.apache.org/jira/browse/BEAM-1913

KinesisIO - https://issues.apache.org/jira/browse/BEAM-1428
PubsubIO - https://issues.apache.org/jira/browse/BEAM-1415
CountingInput - https://issues.apache.org/jira/browse/BEAM-1414

https://github.com/apache/beam/pull/2149 , which fixes BigQueryIO,

is

a

good model to follow when taking these on, as well as e.g.
https://github.com/apache/beam/pull/1927 (TextIO)

These are all actually easy to fix, but need volunteers (I do not

have

time

to fix all of these myself, but happy to be a reviewer - @jkff).
Let's finish this up in time for the first Beam stable release, so

Beam's

stable API surface is consistent and polished!







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Let's make Beam transforms comply with PTransform Style Guide

2017-04-20 Thread tarush grover
No worries.

Regards,
Tarush

On Fri, 21 Apr 2017 at 11:20 AM, Eugene Kirpichov
 wrote:

> Guys, apologies, but I already have Kinesis in review, and Pubsub ready for
> review. I'm afraid there's not much left for volunteers to take on right
> now.
>
> On Thu, Apr 20, 2017 at 10:47 PM Jean-Baptiste Onofré 
> wrote:
>
> > Cool, I gonna take a look on PubSub later today (I would like to finish
> > CassandraIO, HDFS refactoring and Spark 2 support first ;)).
> >
> > Regards
> > JB
> >
> > On 04/21/2017 06:03 AM, tarush grover wrote:
> > > Hi,
> > >
> > > I can take kinesis one.
> > >
> > > Regards,
> > > Tarush
> > >
> > >
> > > On Thu, 20 Apr 2017 at 11:18 AM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > >> Gonna take a look on the pending IOs.
> > >>
> > >> Thanks !
> > >> Regards
> > >> JB
> > >>
> > >> On 04/19/2017 10:05 PM, Eugene Kirpichov wrote:
> > >>> A few more knocked down
> > >>> - I finished Map/FlatMap, XML, TFRecordIO
> > >>> - I'm working on CountingInput; it's nontrivial.
> > >>> - Reuven is working on Text/Avro
> > >>> - @peay is working on removing coders from KafkaIO
> > >>>
> > >>> Kinesis and PubsubIO remain; of these, Kinesis is the easier one.
> > >>>
> > >>> Any takers?
> > >>>
> > >>> On Fri, Apr 7, 2017 at 10:47 PM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > >>> wrote:
> > >>>
> >  Hi Eugene,
> > 
> >  thanks for the update. I'm volunteer to tackle some those IOs (and
> > make
> >  them
> >  conform with PTransform style guide). I'm pretty sure other people
> > will
> >  jump on ;)
> > 
> >  Regards
> >  JB
> > 
> >  On 04/08/2017 12:20 AM, Eugene Kirpichov wrote:
> > > Hey all,
> > >
> > > More progress has been made and we're nearing completion. ParDo,
> >  BigQueryIO
> > > and Window are fixed; Map/FlatMapElements are in review.
> > >
> > > The remaining unclaimed ones are all IOs of some form, and here's a
> > >> list.
> > > I've marked them all as "starter" in JIRA.
> > >
> > > XML - https://issues.apache.org/jira/browse/BEAM-1914
> > > TFRecordIO (Tensorflow) -
> >  https://issues.apache.org/jira/browse/BEAM-1913
> > > KinesisIO - https://issues.apache.org/jira/browse/BEAM-1428
> > > PubsubIO - https://issues.apache.org/jira/browse/BEAM-1415
> > > CountingInput - https://issues.apache.org/jira/browse/BEAM-1414
> > >
> > > https://github.com/apache/beam/pull/2149 , which fixes BigQueryIO,
> > is
> > >> a
> > > good model to follow when taking these on, as well as e.g.
> > > https://github.com/apache/beam/pull/1927 (TextIO)
> > >
> > > These are all actually easy to fix, but need volunteers (I do not
> > have
> >  time
> > > to fix all of these myself, but happy to be a reviewer - @jkff).
> > > Let's finish this up in time for the first Beam stable release, so
> > >> Beam's
> > > stable API surface is consistent and polished!
> > >
> > 
> > >>>
> > >>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: Let's make Beam transforms comply with PTransform Style Guide

2017-04-20 Thread Eugene Kirpichov
Guys, apologies, but I already have Kinesis in review, and Pubsub ready for
review. I'm afraid there's not much left for volunteers to take on right
now.

On Thu, Apr 20, 2017 at 10:47 PM Jean-Baptiste Onofré 
wrote:

> Cool, I gonna take a look on PubSub later today (I would like to finish
> CassandraIO, HDFS refactoring and Spark 2 support first ;)).
>
> Regards
> JB
>
> On 04/21/2017 06:03 AM, tarush grover wrote:
> > Hi,
> >
> > I can take kinesis one.
> >
> > Regards,
> > Tarush
> >
> >
> > On Thu, 20 Apr 2017 at 11:18 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> >> Gonna take a look on the pending IOs.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On 04/19/2017 10:05 PM, Eugene Kirpichov wrote:
> >>> A few more knocked down
> >>> - I finished Map/FlatMap, XML, TFRecordIO
> >>> - I'm working on CountingInput; it's nontrivial.
> >>> - Reuven is working on Text/Avro
> >>> - @peay is working on removing coders from KafkaIO
> >>>
> >>> Kinesis and PubsubIO remain; of these, Kinesis is the easier one.
> >>>
> >>> Any takers?
> >>>
> >>> On Fri, Apr 7, 2017 at 10:47 PM Jean-Baptiste Onofré 
> >>> wrote:
> >>>
>  Hi Eugene,
> 
>  thanks for the update. I'm volunteer to tackle some those IOs (and
> make
>  them
>  conform with PTransform style guide). I'm pretty sure other people
> will
>  jump on ;)
> 
>  Regards
>  JB
> 
>  On 04/08/2017 12:20 AM, Eugene Kirpichov wrote:
> > Hey all,
> >
> > More progress has been made and we're nearing completion. ParDo,
>  BigQueryIO
> > and Window are fixed; Map/FlatMapElements are in review.
> >
> > The remaining unclaimed ones are all IOs of some form, and here's a
> >> list.
> > I've marked them all as "starter" in JIRA.
> >
> > XML - https://issues.apache.org/jira/browse/BEAM-1914
> > TFRecordIO (Tensorflow) -
>  https://issues.apache.org/jira/browse/BEAM-1913
> > KinesisIO - https://issues.apache.org/jira/browse/BEAM-1428
> > PubsubIO - https://issues.apache.org/jira/browse/BEAM-1415
> > CountingInput - https://issues.apache.org/jira/browse/BEAM-1414
> >
> > https://github.com/apache/beam/pull/2149 , which fixes BigQueryIO,
> is
> >> a
> > good model to follow when taking these on, as well as e.g.
> > https://github.com/apache/beam/pull/1927 (TextIO)
> >
> > These are all actually easy to fix, but need volunteers (I do not
> have
>  time
> > to fix all of these myself, but happy to be a reviewer - @jkff).
> > Let's finish this up in time for the first Beam stable release, so
> >> Beam's
> > stable API surface is consistent and polished!
> >
> 
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Let's make Beam transforms comply with PTransform Style Guide

2017-04-20 Thread Jean-Baptiste Onofré
Cool, I gonna take a look on PubSub later today (I would like to finish 
CassandraIO, HDFS refactoring and Spark 2 support first ;)).


Regards
JB

On 04/21/2017 06:03 AM, tarush grover wrote:

Hi,

I can take kinesis one.

Regards,
Tarush


On Thu, 20 Apr 2017 at 11:18 AM, Jean-Baptiste Onofré 
wrote:


Gonna take a look on the pending IOs.

Thanks !
Regards
JB

On 04/19/2017 10:05 PM, Eugene Kirpichov wrote:

A few more knocked down
- I finished Map/FlatMap, XML, TFRecordIO
- I'm working on CountingInput; it's nontrivial.
- Reuven is working on Text/Avro
- @peay is working on removing coders from KafkaIO

Kinesis and PubsubIO remain; of these, Kinesis is the easier one.

Any takers?

On Fri, Apr 7, 2017 at 10:47 PM Jean-Baptiste Onofré 
wrote:


Hi Eugene,

thanks for the update. I'm volunteer to tackle some those IOs (and make
them
conform with PTransform style guide). I'm pretty sure other people will
jump on ;)

Regards
JB

On 04/08/2017 12:20 AM, Eugene Kirpichov wrote:

Hey all,

More progress has been made and we're nearing completion. ParDo,

BigQueryIO

and Window are fixed; Map/FlatMapElements are in review.

The remaining unclaimed ones are all IOs of some form, and here's a

list.

I've marked them all as "starter" in JIRA.

XML - https://issues.apache.org/jira/browse/BEAM-1914
TFRecordIO (Tensorflow) -

https://issues.apache.org/jira/browse/BEAM-1913

KinesisIO - https://issues.apache.org/jira/browse/BEAM-1428
PubsubIO - https://issues.apache.org/jira/browse/BEAM-1415
CountingInput - https://issues.apache.org/jira/browse/BEAM-1414

https://github.com/apache/beam/pull/2149 , which fixes BigQueryIO, is

a

good model to follow when taking these on, as well as e.g.
https://github.com/apache/beam/pull/1927 (TextIO)

These are all actually easy to fix, but need volunteers (I do not have

time

to fix all of these myself, but happy to be a reviewer - @jkff).
Let's finish this up in time for the first Beam stable release, so

Beam's

stable API surface is consistent and polished!







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Towards a spec for robust streaming SQL, Part 1

2017-04-20 Thread Tyler Akidau
Hello Beam, Calcite, and Flink dev lists!

Apologies for the big cross post, but I thought this might be something all
three communities would find relevant.

Beam is finally making progress on a SQL DSL utilizing Calcite, thanks to
Mingmin Xu. As you can imagine, we need to come to some conclusion about
how to elegantly support the full suite of streaming functionality in the
Beam model in via Calcite SQL. You folks in the Flink community have been
pushing on this (e.g., adding windowing constructs, amongst others, thank
you! :-), but from my understanding we still don't have a full spec for how
to support robust streaming in SQL (including but not limited to, e.g., a
triggers analogue such as EMIT).

I've been spending a lot of time thinking about this and have some opinions
about how I think it should look that I've already written down, so I
volunteered to try to drive forward agreement on a general streaming SQL
spec between our three communities (well, technically I volunteered to do
that w/ Beam and Calcite, but I figured you Flink folks might want to join
in since you're going that direction already anyway and will have useful
insights :-).

My plan was to do this by sharing two docs:

   1. The Beam Model : Streams & Tables - This one is for context, and
   really only mentions SQL in passing. But it describes the relationship
   between the Beam Model and the "streams & tables" way of thinking, which
   turns out to be useful in understanding what robust streaming in SQL might
   look like. Many of you probably already know some or all of what's in here,
   but I felt it was necessary to have it all written down in order to justify
   some of the proposals I wanted to make in the second doc.

   2. A streaming SQL spec for Calcite - The goal for this doc is that it
   would become a general specification for what robust streaming SQL in
   Calcite should look like. It would start out as a basic proposal of what
   things *could* look like (combining both what things look like now as well
   as a set of proposed changes for the future), and we could all iterate on
   it together until we get to something we're happy with.

At this point, I have doc #1 ready, and it's a bit of a monster, so I
figured I'd share it and let folks hack at it with comments if they have
any, while I try to get the second doc ready in the meantime. As part of
getting doc #2 ready, I'll be starting a separate thread to try to gather
input on what things are already in flight for streaming SQL across the
various communities, to make sure the proposal captures everything that's
going on as accurately as it can.

If you have any questions or comments, I'm interested to hear them.
Otherwise, here's doc #1, "The Beam Model : Streams & Tables":

  http://s.apache.org/beam-streams-tables

-Tyler


Re: Can application specify how watermarks should be generated?

2017-04-20 Thread Kenneth Knowles
You want to use an existing source but just change the watermark tracking?
You can't do this in your pipeline right now, but you could probably easily
wrap a source and proxy every method except getWatermark, though I have
never tried.

The general feature that might address this is discussed a little on
https://issues.apache.org/jira/browse/BEAM-644

There are also related ideas in the discussions about Splittable DoFn.

Kenn

On Thu, Apr 20, 2017 at 1:42 PM, Shen Li  wrote:

> Hi,
>
> Can application developers provide classes/methods to specify how to
> generate watermarks from sources, and how to aggregate watermarks from
> multiple input PCollections? Say, emit at most 1 watermark per second, or
> create watermarks that are 5 seconds older than the latest tuple's
> timestamp?
>
> Thanks,
>
> Shen
>


Jenkins build is back to normal : beam_SeedJob #228

2017-04-20 Thread Apache Jenkins Server
See 



Can application specify how watermarks should be generated?

2017-04-20 Thread Shen Li
Hi,

Can application developers provide classes/methods to specify how to
generate watermarks from sources, and how to aggregate watermarks from
multiple input PCollections? Say, emit at most 1 watermark per second, or
create watermarks that are 5 seconds older than the latest tuple's
timestamp?

Thanks,

Shen


Re: Should you always have a separate PTransform class for a new transform?

2017-04-20 Thread Eugene Kirpichov
The discussion has been reignited on
https://github.com/apache/beam/pull/2603 .
I can see the pretty strong argument for not replicating all the features
of Combine on the hypothetical helper transform like Count.Globally.

I guess, then, I like Robert's option of having an interface like
"CombiningPTransform" with all the builder methods like withoutDefaults
etc., and returning that from transforms like Count.globally(), so that for
now one can simply return Combine.globally() from it (which would of course
implement this interface) but if later the implementation changes, it can
change to a composite transform builder which would also implement this
interface.

Notes on the PR itself:
- The change above would be a backward-compatible change so it wouldn't
necessarily have to go in before Beam stable release.
- However, if the current PR is submitted in the current form (returning a
Combine.Globally), then afterwards this change *would* be incompatible and
would have to be done before stable release.
So I'd suggest to hold off the PR until we reach consensus here.

On Wed, Feb 8, 2017 at 2:09 PM Eugene Kirpichov 
wrote:

> I think the value in having Mean.perKey() in addition to Mean.combineFn()
> is that using Mean.perKey() does not require knowledge of the combine
> concept, so easier for users. Generally, when using Beam, to simply compute
> a count or a mean, you should not need to know about combine.
>
> On Wed, Feb 8, 2017 at 2:06 PM Robert Bradshaw 
> wrote:
>
>> On Wed, Feb 8, 2017 at 1:27 PM, Eugene Kirpichov <
>> kirpic...@google.com.invalid> wrote:
>>
>> > So... Would it be fair to say that everybody would be satisfied if we
>> > treated the "glorified combine" transforms (Sum, Count, Mean, Sample,
>> > Latest) the following way:
>> > - For each case, SDK must expose the relevant CombineFn as a static
>> factory
>> > function: e.g. Sum.ofIntegers(), Latest.of(), etc. [it may make sense to
>> > discuss the naming of these CombineFn factory functions so they are not
>> > confused with neighboring PTransform factory functions]
>> >
>>
>> Or just expose the CombineFn class itself, rather than a factory method to
>> construct it? As for naming (if we go for factories), maybe
>> Mean.combineFn().
>>
>>
>> > - Expose also a PTransform factory function, returning either a
>> > PTransform (if there's nothing to configure on the
>> > transform), or its concrete subclass (if there is something to
>> configure).
>> > Transition from PTransform to subclass is always possible in a
>> > backward-compatible way, so it's safe to err on the side of returning
>> > PTransform.
>> > - Do *not* return concrete types from the PTransform factory function
>> such
>> > as Combine.Globally - instead, if the user has an advanced use case and
>> > wants to configure the combine, they should apply Combine.globally()
>> > themselves to your exposed CombineFn.
>> >
>> > In particular, this means:
>> > - Sum, Mean stay unchanged
>> > - Count, Sample, Latest should additionally expose their CombineFn's:
>> > Count.of()? or how should we name them?
>> > - Count.globally() and Count.perKey() should be changed from returning
>> > Combine.Globally and Combine.PerKey to returning the more general type
>> > PTransform<..., ...>. Cases where the user relies on them returning a
>> > Combine should be changed to applying the Combine manually.
>> >
>> > Makes sense?
>> >
>>
>> So the value in Mean.perKey() is solely in the fact that it's pithier for
>> Combine.perKey(Mean.combineFn())? Or do we assume that the former could
>> possibly get a new implementation, but the latter should be used if
>> additional configuration is needed?
>>
>> On Tue, Feb 7, 2017 at 10:50 PM Dan Halperin > >
>> > wrote:
>> >
>> > > I am generally persuaded to at least change my number to something
>> like 0
>> > > :). These are pretty reasonable perspectives, especially pointing out
>> > that
>> > > withSideInputs is pretty useless in Count ;)
>> > >
>> > > On Tue, Feb 7, 2017 at 10:04 PM, Kenneth Knowles
>> > > >
>> > > wrote:
>> > >
>> > > >  On Tue, Feb 7, 2017 at 8:43 PM, Eugene Kirpichov <
>> > > >
>> > > > > kirpic...@google.com.invalid> wrote:
>> > > > > I must admit I didn't quite
>> > > > > understand the option of "implements CombiningTransform".
>> > > > >
>> > > >
>> > > > On Tue, Feb 7, 2017 at 9:04 PM, Robert Bradshaw
>> > > > > > > > > wrote:
>> > > >
>> > > > > Sorry, I'll try to clarify. ... <>...
>> > > > >
>> > > >
>> > > > FWIW this is also what I meant by my 3b "HasACombineInsideIt". The
>> > > > difference between my suggestion and Robert's is that I used a self
>> > type,
>> > > > which is really not worth the trouble (in fact, it blocked me from
>> > > bringing
>> > > > this to fruition last time we had this same conversation).
>> > > >
>> > > > Kenn
>> > > >
>> > >
>> >
>>
>


AfterWatermarkEarlyAndLate

2017-04-20 Thread Wesley Tanaka
AfterWatermarkEarlyAndLate has:
   public AfterWatermarkEarlyAndLate withEarlyFirings(OnceTrigger earlyTrigger)
   public AfterWatermarkEarlyAndLate withLateFirings(OnceTrigger lateTrigger)
FromEndOfWindow has:
   public AfterWatermarkEarlyAndLate withEarlyFirings(OnceTrigger earlyFirings) 
  public AfterWatermarkEarlyAndLate withLateFirings(OnceTrigger lateFirings)
As a means of trying to understand the trigger API, I am curious if it might 
make sense conceptually to:
1. rename AfterWatermarkEarlyAndLate => EarlyAndLate since it is already an 
inner class of AfterWatermark2. get rid of FromEndOfWindow by having 
AfterWatermark.pastEndOfWindow() return a new 
AfterWatermarkEarlyAndLate(Never.ever(), null);
Or is there a value to having FromEndOfWindow be separate that I am not 
understanding?

---
Wesley Tanaka
https://wtanaka.com/