Re: [DISCUSS] Current ongoing work on runners

2016-10-25 Thread Jean-Baptiste Onofré
Good idea.

Regards
JB

⁣​

On Oct 25, 2016, 17:57, at 17:57, Aljoscha Krettek  wrote:
>I think we might need to update the capability matrix with some of the
>new
>features that have popped up. Immediate things that come to mind are:
>* Timer/State API for user DoFns (coupled with new-style DoFn) (not yet
>completely in master)
> * SplittableDoFn
>
>This would allow tracking the process in each of these for each runner
>and
>would not require hunting for that information in email threads.
>
>On Tue, 25 Oct 2016 at 08:12 Jean-Baptiste Onofré 
>wrote:
>
>> +1. For me it's one of the most important point for the new website.
>We
>> should give a clear and exhaustive list of what we have, both for
>runners
>> and IOs (with supported features).
>>
>> Regards
>> JB
>>
>> ⁣​
>>
>> On Oct 24, 2016, 21:52, at 21:52, "Ismaël Mejía" 
>> wrote:
>> >Hello,
>> >
>> >I am really happy to see new runners been contributed to our
>community
>> >(e.g. GearPump and Apex recently). We have not discussed a lot about
>> >the
>> >current capabilities of both runners.
>> >
>> >Following the recent discussion about making ongoing work more
>explicit
>> >in
>> >the mailing list, I would like to ask the people involved about the
>> >current
>> >status of them, I think it is important to discuss this (apart of
>> >creating
>> >the given JIRAs + updating the capability matrix docs) because more
>> >people
>> >can eventually jump and give a hand on open issues.
>> >
>> >I remember there was a google doc for the  capabilities of each
>runner,
>> >is
>> >this doc still available (sorry I lost the link). I suppose that
>once
>> >these
>> >ongoing runners mature we can add this doc also to the website.
>> >https://beam.apache.org/learn/runners/capability-matrix/
>> >
>> >Regards,
>> >Ismaël
>> >
>> >ps. @Amit, given that the spark 2 (Dataset based) runner has also a
>> >feature
>> >branch, if you consider it worth, can you please share a bit about
>that
>> >work too.
>> >
>> >ps2. Can anyone please share the link to the google doc I was
>talking
>> >about, I can't find it after the recent changes to the website.
>> >​
>>


Re: [DISCUSS] Current ongoing work on runners

2016-10-25 Thread Manu Zhang
We usually have docs sitting together with the source codes such that each
release has its own versioned docs. If the capability matrix is like other
codes, we can update it as we add new features. It also applies to other
docs like new IO. We could make it a requirement in the PR template.

Thanks,
Manu


On Wed, Oct 26, 2016 at 7:24 AM Thomas Weise  wrote:

> I'm planning to take up the discussion about Apex runner current state and
> proposed next steps in a separate thread.
>
> Thanks,
> Thomas
>
>
> On Tue, Oct 25, 2016 at 10:32 AM, Amit Sela  wrote:
>
> > SparkRunner status:
> >
> > V1 (Spark 1.6.x - DStream/RDD API):
> > *Batch*: Full model support for batch, continuous ROS testing setup is in
> > process now so that CI will validate constantly.
> > *Streaming*: Supporting UnboundedSource is in review
> > , starting to work
> on
> > triggers and accumulation modes now.
> >
> > V2 (Spark 2.x - Dataset API):
> > This is on hold for now as Spark 2.0 - Dataset AP for streaming (AKA
> > "Structured Streaming") is marked Alpha.
> > In addition, there are still some basic properties in the Dataset API
> that
> > are missing and will be required to properly support Beam:
> >
> >- Stateful operators.
> >- Encoders (Spark's new schema-based coders) optimization support for
> >classes that are a bit more sophisticated than POJO's (generics, inner
> >classes, etc.).
> >
> > Also waiting to see if Watermarks and purging late/stale data will be
> > introduced in 2.1 (currently the Dataset grows indefinitely which is not
> > something acceptable for production applications).
> > Once this becomes more clear (2.1 release ?) I will get back to working
> on
> > this because in general the Dataset API is preferred as it is actually a
> > real unified API for batch and streaming (and the schema-based
> > optimizations are also interesting).
> >
> > I hope this gives a clear view of the SparkRunner status, feel free to
> ping
> > me for more details on the user/dev list or Slack.
> >
> > Thanks,
> > Amit
> >
> > On Tue, Oct 25, 2016 at 6:57 PM Aljoscha Krettek 
> > wrote:
> >
> > > I think we might need to update the capability matrix with some of the
> > new
> > > features that have popped up. Immediate things that come to mind are:
> > >  * Timer/State API for user DoFns (coupled with new-style DoFn) (not
> yet
> > > completely in master)
> > >  * SplittableDoFn
> > >
> > > This would allow tracking the process in each of these for each runner
> > and
> > > would not require hunting for that information in email threads.
> > >
> > > On Tue, 25 Oct 2016 at 08:12 Jean-Baptiste Onofré 
> > wrote:
> > >
> > > > +1. For me it's one of the most important point for the new website.
> We
> > > > should give a clear and exhaustive list of what we have, both for
> > runners
> > > > and IOs (with supported features).
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > ⁣​
> > > >
> > > > On Oct 24, 2016, 21:52, at 21:52, "Ismaël Mejía" 
> > > > wrote:
> > > > >Hello,
> > > > >
> > > > >I am really happy to see new runners been contributed to our
> community
> > > > >(e.g. GearPump and Apex recently). We have not discussed a lot about
> > > > >the
> > > > >current capabilities of both runners.
> > > > >
> > > > >Following the recent discussion about making ongoing work more
> > explicit
> > > > >in
> > > > >the mailing list, I would like to ask the people involved about the
> > > > >current
> > > > >status of them, I think it is important to discuss this (apart of
> > > > >creating
> > > > >the given JIRAs + updating the capability matrix docs) because more
> > > > >people
> > > > >can eventually jump and give a hand on open issues.
> > > > >
> > > > >I remember there was a google doc for the  capabilities of each
> > runner,
> > > > >is
> > > > >this doc still available (sorry I lost the link). I suppose that
> once
> > > > >these
> > > > >ongoing runners mature we can add this doc also to the website.
> > > > >https://beam.apache.org/learn/runners/capability-matrix/
> > > > >
> > > > >Regards,
> > > > >Ismaël
> > > > >
> > > > >ps. @Amit, given that the spark 2 (Dataset based) runner has also a
> > > > >feature
> > > > >branch, if you consider it worth, can you please share a bit about
> > that
> > > > >work too.
> > > > >
> > > > >ps2. Can anyone please share the link to the google doc I was
> talking
> > > > >about, I can't find it after the recent changes to the website.
> > > > >​
> > > >
> > >
> >
>


Re: [DISCUSS] Current ongoing work on runners

2016-10-25 Thread Amit Sela
SparkRunner status:

V1 (Spark 1.6.x - DStream/RDD API):
*Batch*: Full model support for batch, continuous ROS testing setup is in
process now so that CI will validate constantly.
*Streaming*: Supporting UnboundedSource is in review
, starting to work on
triggers and accumulation modes now.

V2 (Spark 2.x - Dataset API):
This is on hold for now as Spark 2.0 - Dataset AP for streaming (AKA
"Structured Streaming") is marked Alpha.
In addition, there are still some basic properties in the Dataset API that
are missing and will be required to properly support Beam:

   - Stateful operators.
   - Encoders (Spark's new schema-based coders) optimization support for
   classes that are a bit more sophisticated than POJO's (generics, inner
   classes, etc.).

Also waiting to see if Watermarks and purging late/stale data will be
introduced in 2.1 (currently the Dataset grows indefinitely which is not
something acceptable for production applications).
Once this becomes more clear (2.1 release ?) I will get back to working on
this because in general the Dataset API is preferred as it is actually a
real unified API for batch and streaming (and the schema-based
optimizations are also interesting).

I hope this gives a clear view of the SparkRunner status, feel free to ping
me for more details on the user/dev list or Slack.

Thanks,
Amit

On Tue, Oct 25, 2016 at 6:57 PM Aljoscha Krettek 
wrote:

> I think we might need to update the capability matrix with some of the new
> features that have popped up. Immediate things that come to mind are:
>  * Timer/State API for user DoFns (coupled with new-style DoFn) (not yet
> completely in master)
>  * SplittableDoFn
>
> This would allow tracking the process in each of these for each runner and
> would not require hunting for that information in email threads.
>
> On Tue, 25 Oct 2016 at 08:12 Jean-Baptiste Onofré  wrote:
>
> > +1. For me it's one of the most important point for the new website. We
> > should give a clear and exhaustive list of what we have, both for runners
> > and IOs (with supported features).
> >
> > Regards
> > JB
> >
> > ⁣​
> >
> > On Oct 24, 2016, 21:52, at 21:52, "Ismaël Mejía" 
> > wrote:
> > >Hello,
> > >
> > >I am really happy to see new runners been contributed to our community
> > >(e.g. GearPump and Apex recently). We have not discussed a lot about
> > >the
> > >current capabilities of both runners.
> > >
> > >Following the recent discussion about making ongoing work more explicit
> > >in
> > >the mailing list, I would like to ask the people involved about the
> > >current
> > >status of them, I think it is important to discuss this (apart of
> > >creating
> > >the given JIRAs + updating the capability matrix docs) because more
> > >people
> > >can eventually jump and give a hand on open issues.
> > >
> > >I remember there was a google doc for the  capabilities of each runner,
> > >is
> > >this doc still available (sorry I lost the link). I suppose that once
> > >these
> > >ongoing runners mature we can add this doc also to the website.
> > >https://beam.apache.org/learn/runners/capability-matrix/
> > >
> > >Regards,
> > >Ismaël
> > >
> > >ps. @Amit, given that the spark 2 (Dataset based) runner has also a
> > >feature
> > >branch, if you consider it worth, can you please share a bit about that
> > >work too.
> > >
> > >ps2. Can anyone please share the link to the google doc I was talking
> > >about, I can't find it after the recent changes to the website.
> > >​
> >
>


Re: [DISCUSS] Current ongoing work on runners

2016-10-25 Thread Aljoscha Krettek
I think we might need to update the capability matrix with some of the new
features that have popped up. Immediate things that come to mind are:
 * Timer/State API for user DoFns (coupled with new-style DoFn) (not yet
completely in master)
 * SplittableDoFn

This would allow tracking the process in each of these for each runner and
would not require hunting for that information in email threads.

On Tue, 25 Oct 2016 at 08:12 Jean-Baptiste Onofré  wrote:

> +1. For me it's one of the most important point for the new website. We
> should give a clear and exhaustive list of what we have, both for runners
> and IOs (with supported features).
>
> Regards
> JB
>
> ⁣​
>
> On Oct 24, 2016, 21:52, at 21:52, "Ismaël Mejía" 
> wrote:
> >Hello,
> >
> >I am really happy to see new runners been contributed to our community
> >(e.g. GearPump and Apex recently). We have not discussed a lot about
> >the
> >current capabilities of both runners.
> >
> >Following the recent discussion about making ongoing work more explicit
> >in
> >the mailing list, I would like to ask the people involved about the
> >current
> >status of them, I think it is important to discuss this (apart of
> >creating
> >the given JIRAs + updating the capability matrix docs) because more
> >people
> >can eventually jump and give a hand on open issues.
> >
> >I remember there was a google doc for the  capabilities of each runner,
> >is
> >this doc still available (sorry I lost the link). I suppose that once
> >these
> >ongoing runners mature we can add this doc also to the website.
> >https://beam.apache.org/learn/runners/capability-matrix/
> >
> >Regards,
> >Ismaël
> >
> >ps. @Amit, given that the spark 2 (Dataset based) runner has also a
> >feature
> >branch, if you consider it worth, can you please share a bit about that
> >work too.
> >
> >ps2. Can anyone please share the link to the google doc I was talking
> >about, I can't find it after the recent changes to the website.
> >​
>


Re: [DISCUSS] Current ongoing work on runners

2016-10-25 Thread Jean-Baptiste Onofré
+1. For me it's one of the most important point for the new website. We should 
give a clear and exhaustive list of what we have, both for runners and IOs 
(with supported features).

Regards
JB

⁣​

On Oct 24, 2016, 21:52, at 21:52, "Ismaël Mejía"  wrote:
>Hello,
>
>I am really happy to see new runners been contributed to our community
>(e.g. GearPump and Apex recently). We have not discussed a lot about
>the
>current capabilities of both runners.
>
>Following the recent discussion about making ongoing work more explicit
>in
>the mailing list, I would like to ask the people involved about the
>current
>status of them, I think it is important to discuss this (apart of
>creating
>the given JIRAs + updating the capability matrix docs) because more
>people
>can eventually jump and give a hand on open issues.
>
>I remember there was a google doc for the  capabilities of each runner,
>is
>this doc still available (sorry I lost the link). I suppose that once
>these
>ongoing runners mature we can add this doc also to the website.
>https://beam.apache.org/learn/runners/capability-matrix/
>
>Regards,
>Ismaël
>
>ps. @Amit, given that the spark 2 (Dataset based) runner has also a
>feature
>branch, if you consider it worth, can you please share a bit about that
>work too.
>
>ps2. Can anyone please share the link to the google doc I was talking
>about, I can't find it after the recent changes to the website.
>​


Re: [DISCUSS] Current ongoing work on runners

2016-10-24 Thread Robert Bradshaw
I think it would be worth publishing a compatibility matrix, if not on
the main site, as part of the branch itself.

Even better would be if the compatibility matrix was automatically
deduced based on a suite of tests that each runner could (attempt to)
pass.

On Mon, Oct 24, 2016 at 12:52 PM, Ismaël Mejía  wrote:
> Hello,
>
> I am really happy to see new runners been contributed to our community
> (e.g. GearPump and Apex recently). We have not discussed a lot about the
> current capabilities of both runners.
>
> Following the recent discussion about making ongoing work more explicit in
> the mailing list, I would like to ask the people involved about the current
> status of them, I think it is important to discuss this (apart of creating
> the given JIRAs + updating the capability matrix docs) because more people
> can eventually jump and give a hand on open issues.
>
> I remember there was a google doc for the  capabilities of each runner, is
> this doc still available (sorry I lost the link). I suppose that once these
> ongoing runners mature we can add this doc also to the website.
> https://beam.apache.org/learn/runners/capability-matrix/
>
> Regards,
> Ismaël
>
> ps. @Amit, given that the spark 2 (Dataset based) runner has also a feature
> branch, if you consider it worth, can you please share a bit about that
> work too.
>
> ps2. Can anyone please share the link to the google doc I was talking
> about, I can't find it after the recent changes to the website.
>