Re: Starter issue

2017-01-11 Thread Davor Bonaci
Welcome Tim -- it's great to have you join our community!

I found your name in JIRA and assigned the issue to you. Thanks for your
(future) contribution.

On Wed, Jan 11, 2017 at 11:27 PM, Jean-Baptiste Onofré 
wrote:

> Hi Tim
>
> What's your Jira id ?
>
> Thanks
> Regards
> JB⁣​
>
> On Jan 12, 2017, 06:48, at 06:48, Tim Taschke  wrote:
> >Hi,
> >
> >I would like to get started with contributing and thought I'd start
> >with this, if that is ok:
> >https://issues.apache.org/jira/browse/BEAM-1056
> >
> >Could somebody please assign it to me?
> >
> >Best regards,
> >Tim
>


Re: Starter issue

2017-01-11 Thread Jean-Baptiste Onofré
Hi Tim

What's your Jira id ?

Thanks
Regards
JB⁣​

On Jan 12, 2017, 06:48, at 06:48, Tim Taschke  wrote:
>Hi,
>
>I would like to get started with contributing and thought I'd start
>with this, if that is ok:
>https://issues.apache.org/jira/browse/BEAM-1056
>
>Could somebody please assign it to me?
>
>Best regards,
>Tim


Starter issue

2017-01-11 Thread Tim Taschke
Hi,

I would like to get started with contributing and thought I'd start
with this, if that is ok:
https://issues.apache.org/jira/browse/BEAM-1056

Could somebody please assign it to me?

Best regards,
Tim


Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

2017-01-11 Thread Kenneth Knowles
I think WindowMappingFn (https://issues.apache.org/jira/browse/BEAM-260 /
https://s.apache.org/beam-windowmappingfn-1-pager) is a good fit for this.
There are details to shake out.

One big thing it does not address well (because it is focused only on GC
thresholds) is specifically which windows need their state accessible from
which others, hence how much parallelism is available and how much
communication is there between windows. Today it is somewhat moot because
we don't use that parallelism.

On Wed, Jan 11, 2017 at 10:03 AM, Lukasz Cwik 
wrote:

> Bundle processing order is indeterminate, wouldn't accessing user state of
> a different window lead to indeterminate state information. This seems to
> be even weaker then what you get from side inputs that are triggered
> multiple times.
>
> On Wed, Jan 11, 2017 at 10:01 AM, Tyler Akidau  wrote:
>
> > On Wed, Jan 11, 2017 at 9:43 AM Robert Bradshaw
> > 
> > wrote:
> >
> > > On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik  >
> > > wrote:
> > > > I was under the impression that user state was scoped to a ParDo and
> > was
> > > > not shareable across multiple ParDos. Wouldn't rewindowing require
> the
> > > > usage of multiple ParDos and hence not allow for state to be shared?
> > >
> > > No, you'd do something like
> > >
> > > pc.apply(WindowInto(grouping_windowing))
> > >   .apply(GroupByKey())
> > >   .apply(WindowInto(state_windowing)
> > >   .apply(ParDo(state_using_dofn)
> > >
> > > You could reify the window after GroupByKey if you need to inspect it.
> > >
> > > However, I'm liking the idea of being able to associate different
> > > WindowFns with particular state tags similar to side inputs (though
> > > the default would be the windowing of the main input).
> > >
> >
> > Can you expand upon what you mean by this? I'm not sure I understand what
> > you're getting at yet.
> >
> > -Tyler
> >
> >
> > >
> > > > On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> > > > rober...@google.com.invalid> wrote:
> > > >
> > > >> Possibly this could be handled by rewindowing and the current
> > > semantics. If
> > > >> not, maybe treat state like a side input with its own windowing and
> > > window
> > > >> mapping fn.
> > > >>
> > > >> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" 
> > wrote:
> > > >>
> > > >> > Ben Chambers created BEAM-1261:
> > > >> > --
> > > >> >
> > > >> >  Summary: State API should allow state to be managed
> in
> > > >> > different windows
> > > >> >  Key: BEAM-1261
> > > >> >  URL: https://issues.apache.org/
> > jira/browse/BEAM-1261
> > > >> >  Project: Beam
> > > >> >   Issue Type: Bug
> > > >> >   Components: beam-model, sdk-java-core
> > > >> > Reporter: Ben Chambers
> > > >> > Assignee: Kenneth Knowles
> > > >> >
> > > >> >
> > > >> > For example, even if the elements are being processed in fixed
> > > windows of
> > > >> > an hour, it may be desirable for the state to "roll over" between
> > > windows
> > > >> > (or be available to all windows).
> > > >> >
> > > >> > It will also be necessary to figure out when this state should be
> > > deleted
> > > >> > (TTL? maximum retention?)
> > > >> >
> > > >> > Another problem is how to deal with out of order data. If data
> comes
> > > in
> > > >> > from the 10:00 AM window, should its state changes be visible to
> the
> > > data
> > > >> > in the 9:00 AM window?
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > This message was sent by Atlassian JIRA
> > > >> > (v6.3.4#6332)
> > > >> >
> > > >>
> > >
> >
>


Re: Beam ML

2017-01-11 Thread Suneel Marthi
Mahout would have CSR support in the upcoming release as well as hybrid
GPU/CPU execution depending on the work load.

This discussion is better moved to dev@mahout, since the Mahout project has
long abstracted out Spark, Flink and H2O for distributed linear algebra;
and we are now adding swappable Math backend - ViennaCL, cublas, etc..




On Wed, Jan 11, 2017 at 3:49 PM, Kam Kasravi  wrote:

> Thanks Andrew -
>
> Since you're quite familiar with how Mahout backends (flink, spark, h20)
> bind and enable DRM and the API Mahout/Samsara exposes -
> I think the end goal would be to surface a JAVA/Python API as well as
> outline a declarative syntax that the various runners can adhere to
> (perhaps via some kind of execution plan).
>
> This would be added to the current design document within a 'low-level /
> linear algebra' section.
>
> What are your thoughts (and Mahout's thoughts) on SystemML's approach to
> DRM and Declarative Machine Learning (DML) in general?
> http://www.vldb.org/pvldb/vol9/p960-elgohary.pdf
> Does Mahout plan on supporting SystemML compressed linear algebra (CLA)?
>
> Would you be willing to help (along with others including Vladisav) in
> providing a common ML design document for Beam?
>
>
>
> On Wed, Jan 11, 2017 at 12:07 PM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
>
> > That's right; what other info do you think would be useful?
> >
> > On Tue, Jan 10, 2017 at 11:09 AM, Kam Kasravi 
> > wrote:
> >
> > > Thanks Andrew
> > > I think more information about the DRM operations and how persistence
> > > would be done at the runner level. It looks like HDFS or spark caching
> is
> > > currently being used?
> > >
> > > On Monday, January 9, 2017 6:04 PM, Andrew Musselman <
> a...@apache.org
> > >
> > > wrote:
> > >
> > >
> > >  Hello Beam Team,
> > >
> > > Thought you might be interested in the work we've been doing on Mahout,
> > > such as the distributed linear algebra DSL/front-end that can use
> > multiple
> > > back-ends for compute (Spark, Flink, H2O now). See
> > > https://mahout.apache.org/users/environment/out-of-core-reference.html
> > for
> > > an intro.
> > >
> > > We also are working on native CPU/GPU hybrid support and we're close to
> > an
> > > initial release. Let us know if you'd like to know more.
> > >
> > > Thanks and best of luck!
> > >
> > > Best
> > > Andrew Musselman
> > >
> > > On 2017-01-09 12:00 (-0800), Kam Kasravi  wrote:
> > > > Hi Vladisav
> > > >
> > > > I'm the author of the design document. An area we stalled on was
> > creating
> > > a
> > > > common low level linear algebra library that would also include
> > > > optimizations like MKL but across platforms and GPUs.
> > > > Additionally there are efforts underway that provide a scoring API
> vs a
> > > > training API.
> > > >
> > > >- PredictionIO http://predictionio.incubator.apache.org/ (now
> part
> > of
> > > >Salesforce)
> > > >- MLeap https://github.com/combust/mleap
> > > >- PFA - Portable Format for Analytics http://dmg.org/pfa/
> > > >
> > > > Any ML effort needs to also include deep learning and the ability to
> > > > integrate various types of neural networks. Apache has several early
> > > > efforts in this regard (mxnet, singa).
> > > >
> > > > Thanks
> > > > Kam
> > > >
> > > > On Fri, Jan 6, 2017 at 7:07 AM, Vladisav Jelisavcic <
> > vladis...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > what is the current status on BEAM-478 and BEAM-303 (machine
> learning
> > > > > learning DSL and related functions)?
> > > > > I would like to start contributing in this direction.
> > > > >
> > > > > I found this design document:
> > > > > https://docs.google.com/document/d/17cRZk_
> > > yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > > > ECHb-xA/edit#heading=h.n51rhya8bv4f
> > > > >
> > > > > Are there any other docs/advances related to this?
> > > > >
> > > > >
> > > > > Best regards,
> > > > > Vladisav
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>


Re: Beam ML

2017-01-11 Thread Kam Kasravi
Thanks Andrew -

Since you're quite familiar with how Mahout backends (flink, spark, h20)
bind and enable DRM and the API Mahout/Samsara exposes -
I think the end goal would be to surface a JAVA/Python API as well as
outline a declarative syntax that the various runners can adhere to
(perhaps via some kind of execution plan).

This would be added to the current design document within a 'low-level /
linear algebra' section.

What are your thoughts (and Mahout's thoughts) on SystemML's approach to
DRM and Declarative Machine Learning (DML) in general?
http://www.vldb.org/pvldb/vol9/p960-elgohary.pdf
Does Mahout plan on supporting SystemML compressed linear algebra (CLA)?

Would you be willing to help (along with others including Vladisav) in
providing a common ML design document for Beam?



On Wed, Jan 11, 2017 at 12:07 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> That's right; what other info do you think would be useful?
>
> On Tue, Jan 10, 2017 at 11:09 AM, Kam Kasravi 
> wrote:
>
> > Thanks Andrew
> > I think more information about the DRM operations and how persistence
> > would be done at the runner level. It looks like HDFS or spark caching is
> > currently being used?
> >
> > On Monday, January 9, 2017 6:04 PM, Andrew Musselman  >
> > wrote:
> >
> >
> >  Hello Beam Team,
> >
> > Thought you might be interested in the work we've been doing on Mahout,
> > such as the distributed linear algebra DSL/front-end that can use
> multiple
> > back-ends for compute (Spark, Flink, H2O now). See
> > https://mahout.apache.org/users/environment/out-of-core-reference.html
> for
> > an intro.
> >
> > We also are working on native CPU/GPU hybrid support and we're close to
> an
> > initial release. Let us know if you'd like to know more.
> >
> > Thanks and best of luck!
> >
> > Best
> > Andrew Musselman
> >
> > On 2017-01-09 12:00 (-0800), Kam Kasravi  wrote:
> > > Hi Vladisav
> > >
> > > I'm the author of the design document. An area we stalled on was
> creating
> > a
> > > common low level linear algebra library that would also include
> > > optimizations like MKL but across platforms and GPUs.
> > > Additionally there are efforts underway that provide a scoring API vs a
> > > training API.
> > >
> > >- PredictionIO http://predictionio.incubator.apache.org/ (now part
> of
> > >Salesforce)
> > >- MLeap https://github.com/combust/mleap
> > >- PFA - Portable Format for Analytics http://dmg.org/pfa/
> > >
> > > Any ML effort needs to also include deep learning and the ability to
> > > integrate various types of neural networks. Apache has several early
> > > efforts in this regard (mxnet, singa).
> > >
> > > Thanks
> > > Kam
> > >
> > > On Fri, Jan 6, 2017 at 7:07 AM, Vladisav Jelisavcic <
> vladis...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > what is the current status on BEAM-478 and BEAM-303 (machine learning
> > > > learning DSL and related functions)?
> > > > I would like to start contributing in this direction.
> > > >
> > > > I found this design document:
> > > > https://docs.google.com/document/d/17cRZk_
> > yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > > ECHb-xA/edit#heading=h.n51rhya8bv4f
> > > >
> > > > Are there any other docs/advances related to this?
> > > >
> > > >
> > > > Best regards,
> > > > Vladisav
> > > >
> > >
> >
> >
> >
> >
>


Re: Beam ML

2017-01-11 Thread Andrew Musselman
That's right; what other info do you think would be useful?

On Tue, Jan 10, 2017 at 11:09 AM, Kam Kasravi  wrote:

> Thanks Andrew
> I think more information about the DRM operations and how persistence
> would be done at the runner level. It looks like HDFS or spark caching is
> currently being used?
>
> On Monday, January 9, 2017 6:04 PM, Andrew Musselman 
> wrote:
>
>
>  Hello Beam Team,
>
> Thought you might be interested in the work we've been doing on Mahout,
> such as the distributed linear algebra DSL/front-end that can use multiple
> back-ends for compute (Spark, Flink, H2O now). See
> https://mahout.apache.org/users/environment/out-of-core-reference.html for
> an intro.
>
> We also are working on native CPU/GPU hybrid support and we're close to an
> initial release. Let us know if you'd like to know more.
>
> Thanks and best of luck!
>
> Best
> Andrew Musselman
>
> On 2017-01-09 12:00 (-0800), Kam Kasravi  wrote:
> > Hi Vladisav
> >
> > I'm the author of the design document. An area we stalled on was creating
> a
> > common low level linear algebra library that would also include
> > optimizations like MKL but across platforms and GPUs.
> > Additionally there are efforts underway that provide a scoring API vs a
> > training API.
> >
> >- PredictionIO http://predictionio.incubator.apache.org/ (now part of
> >Salesforce)
> >- MLeap https://github.com/combust/mleap
> >- PFA - Portable Format for Analytics http://dmg.org/pfa/
> >
> > Any ML effort needs to also include deep learning and the ability to
> > integrate various types of neural networks. Apache has several early
> > efforts in this regard (mxnet, singa).
> >
> > Thanks
> > Kam
> >
> > On Fri, Jan 6, 2017 at 7:07 AM, Vladisav Jelisavcic  >
> > wrote:
> >
> > > Hi everyone,
> > >
> > > what is the current status on BEAM-478 and BEAM-303 (machine learning
> > > learning DSL and related functions)?
> > > I would like to start contributing in this direction.
> > >
> > > I found this design document:
> > > https://docs.google.com/document/d/17cRZk_
> yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > ECHb-xA/edit#heading=h.n51rhya8bv4f
> > >
> > > Are there any other docs/advances related to this?
> > >
> > >
> > > Best regards,
> > > Vladisav
> > >
> >
>
>
>
>


Re: Jenkins build is still unstable: beam_PostCommit_Java_RunnableOnService_Dataflow #2011

2017-01-11 Thread Kenneth Knowles
We are taking a few minutes to consider roll forwards vs roll back the
improvements to PAssert.

On Wed, Jan 11, 2017 at 11:11 AM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  RunnableOnService_Dataflow/2011/>
>
>


Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

2017-01-11 Thread Lukasz Cwik
Bundle processing order is indeterminate, wouldn't accessing user state of
a different window lead to indeterminate state information. This seems to
be even weaker then what you get from side inputs that are triggered
multiple times.

On Wed, Jan 11, 2017 at 10:01 AM, Tyler Akidau  wrote:

> On Wed, Jan 11, 2017 at 9:43 AM Robert Bradshaw
> 
> wrote:
>
> > On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik 
> > wrote:
> > > I was under the impression that user state was scoped to a ParDo and
> was
> > > not shareable across multiple ParDos. Wouldn't rewindowing require the
> > > usage of multiple ParDos and hence not allow for state to be shared?
> >
> > No, you'd do something like
> >
> > pc.apply(WindowInto(grouping_windowing))
> >   .apply(GroupByKey())
> >   .apply(WindowInto(state_windowing)
> >   .apply(ParDo(state_using_dofn)
> >
> > You could reify the window after GroupByKey if you need to inspect it.
> >
> > However, I'm liking the idea of being able to associate different
> > WindowFns with particular state tags similar to side inputs (though
> > the default would be the windowing of the main input).
> >
>
> Can you expand upon what you mean by this? I'm not sure I understand what
> you're getting at yet.
>
> -Tyler
>
>
> >
> > > On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> > > rober...@google.com.invalid> wrote:
> > >
> > >> Possibly this could be handled by rewindowing and the current
> > semantics. If
> > >> not, maybe treat state like a side input with its own windowing and
> > window
> > >> mapping fn.
> > >>
> > >> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)" 
> wrote:
> > >>
> > >> > Ben Chambers created BEAM-1261:
> > >> > --
> > >> >
> > >> >  Summary: State API should allow state to be managed in
> > >> > different windows
> > >> >  Key: BEAM-1261
> > >> >  URL: https://issues.apache.org/
> jira/browse/BEAM-1261
> > >> >  Project: Beam
> > >> >   Issue Type: Bug
> > >> >   Components: beam-model, sdk-java-core
> > >> > Reporter: Ben Chambers
> > >> > Assignee: Kenneth Knowles
> > >> >
> > >> >
> > >> > For example, even if the elements are being processed in fixed
> > windows of
> > >> > an hour, it may be desirable for the state to "roll over" between
> > windows
> > >> > (or be available to all windows).
> > >> >
> > >> > It will also be necessary to figure out when this state should be
> > deleted
> > >> > (TTL? maximum retention?)
> > >> >
> > >> > Another problem is how to deal with out of order data. If data comes
> > in
> > >> > from the 10:00 AM window, should its state changes be visible to the
> > data
> > >> > in the 9:00 AM window?
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > This message was sent by Atlassian JIRA
> > >> > (v6.3.4#6332)
> > >> >
> > >>
> >
>


Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

2017-01-11 Thread Tyler Akidau
On Wed, Jan 11, 2017 at 9:43 AM Robert Bradshaw 
wrote:

> On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik 
> wrote:
> > I was under the impression that user state was scoped to a ParDo and was
> > not shareable across multiple ParDos. Wouldn't rewindowing require the
> > usage of multiple ParDos and hence not allow for state to be shared?
>
> No, you'd do something like
>
> pc.apply(WindowInto(grouping_windowing))
>   .apply(GroupByKey())
>   .apply(WindowInto(state_windowing)
>   .apply(ParDo(state_using_dofn)
>
> You could reify the window after GroupByKey if you need to inspect it.
>
> However, I'm liking the idea of being able to associate different
> WindowFns with particular state tags similar to side inputs (though
> the default would be the windowing of the main input).
>

Can you expand upon what you mean by this? I'm not sure I understand what
you're getting at yet.

-Tyler


>
> > On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> >> Possibly this could be handled by rewindowing and the current
> semantics. If
> >> not, maybe treat state like a side input with its own windowing and
> window
> >> mapping fn.
> >>
> >> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)"  wrote:
> >>
> >> > Ben Chambers created BEAM-1261:
> >> > --
> >> >
> >> >  Summary: State API should allow state to be managed in
> >> > different windows
> >> >  Key: BEAM-1261
> >> >  URL: https://issues.apache.org/jira/browse/BEAM-1261
> >> >  Project: Beam
> >> >   Issue Type: Bug
> >> >   Components: beam-model, sdk-java-core
> >> > Reporter: Ben Chambers
> >> > Assignee: Kenneth Knowles
> >> >
> >> >
> >> > For example, even if the elements are being processed in fixed
> windows of
> >> > an hour, it may be desirable for the state to "roll over" between
> windows
> >> > (or be available to all windows).
> >> >
> >> > It will also be necessary to figure out when this state should be
> deleted
> >> > (TTL? maximum retention?)
> >> >
> >> > Another problem is how to deal with out of order data. If data comes
> in
> >> > from the 10:00 AM window, should its state changes be visible to the
> data
> >> > in the 9:00 AM window?
> >> >
> >> >
> >> >
> >> > --
> >> > This message was sent by Atlassian JIRA
> >> > (v6.3.4#6332)
> >> >
> >>
>


Re: Graduation!

2017-01-11 Thread Thomas Groh
This is sweet. Congratulations to everyone involved in making this happen,
and I'm excited to see where we go from here.

On Wed, Jan 11, 2017 at 5:09 AM, amarouni  wrote:

> Congratulations to everyone on this important milestone.
>
>
> On 11/01/2017 11:52, Neelesh Salian wrote:
> > Congratulations to the community. :)
> >
> > On Jan 11, 2017 3:37 PM, "Stephan Ewen"  wrote:
> >
> >> Very nice :-)
> >>
> >> Good to see this happening!
> >>
> >> On Tue, Jan 10, 2017 at 11:58 PM, Tyler Akidau
>  >> wrote:
> >>
> >>> Congrats and thanks to everyone who helped make this happen! :-D
> >>>
> >>> -Tyler
> >>>
> >>> On Tue, Jan 10, 2017 at 2:20 PM Kenneth Knowles  >
> >>> wrote:
> >>>
> >>> This is really exciting. It is such a privilege to be involved with
> this
> >>> project & community.
> >>>
> >>> Kenn
> >>>
> >>> On Tue, Jan 10, 2017 at 1:42 PM, JongYoon Lim 
> >>> wrote:
> >>>
>  Congrats to everyone involved !
> 
>  Best Regards,
>  JongYoon
> 
>  2017-01-11 7:18 GMT+13:00 Jean-Baptiste Onofré :
> 
> > Congrats to the team !!
> >
> > I'm proud and glad to humbly be part of it.
> >
> > Regards
> > JB⁣​
> >
> > On Jan 10, 2017, 19:09, at 19:09, Raghu Angadi
>  
> > wrote:
> >> Congrats to everyone involved.
> >>
> >> It has been a great experience following the rapid progress of Beam
> >>> and
> >> hard work of many. Well deserved promotion.
> >>
> >>
> >> On Tue, Jan 10, 2017 at 3:07 AM, Davor Bonaci 
> >>> wrote:
> >>> The ASF has publicly announced our graduation!
> >>>
> >>>
> >>> https://blogs.apache.org/foundation/entry/the-apache-
> >>> software-foundation-announces
> >>>
> >>> https://beam.apache.org/blog/2017/01/10/beam-graduates.html
> >>>
> >>> Graduation is a recognition of the community that we have built
> >> together. I
> >>> am humbled to be part of this group and this project, and so
> >> excited
> >> for
> >>> what we can accomplish together going forward.
> >>>
> >>> Davor
> >>>
>
>


Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

2017-01-11 Thread Robert Bradshaw
On Wed, Jan 11, 2017 at 8:59 AM, Lukasz Cwik  wrote:
> I was under the impression that user state was scoped to a ParDo and was
> not shareable across multiple ParDos. Wouldn't rewindowing require the
> usage of multiple ParDos and hence not allow for state to be shared?

No, you'd do something like

pc.apply(WindowInto(grouping_windowing))
  .apply(GroupByKey())
  .apply(WindowInto(state_windowing)
  .apply(ParDo(state_using_dofn)

You could reify the window after GroupByKey if you need to inspect it.

However, I'm liking the idea of being able to associate different
WindowFns with particular state tags similar to side inputs (though
the default would be the windowing of the main input).

> On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
>> Possibly this could be handled by rewindowing and the current semantics. If
>> not, maybe treat state like a side input with its own windowing and window
>> mapping fn.
>>
>> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)"  wrote:
>>
>> > Ben Chambers created BEAM-1261:
>> > --
>> >
>> >  Summary: State API should allow state to be managed in
>> > different windows
>> >  Key: BEAM-1261
>> >  URL: https://issues.apache.org/jira/browse/BEAM-1261
>> >  Project: Beam
>> >   Issue Type: Bug
>> >   Components: beam-model, sdk-java-core
>> > Reporter: Ben Chambers
>> > Assignee: Kenneth Knowles
>> >
>> >
>> > For example, even if the elements are being processed in fixed windows of
>> > an hour, it may be desirable for the state to "roll over" between windows
>> > (or be available to all windows).
>> >
>> > It will also be necessary to figure out when this state should be deleted
>> > (TTL? maximum retention?)
>> >
>> > Another problem is how to deal with out of order data. If data comes in
>> > from the 10:00 AM window, should its state changes be visible to the data
>> > in the 9:00 AM window?
>> >
>> >
>> >
>> > --
>> > This message was sent by Atlassian JIRA
>> > (v6.3.4#6332)
>> >
>>


Re: [jira] [Created] (BEAM-1261) State API should allow state to be managed in different windows

2017-01-11 Thread Lukasz Cwik
I was under the impression that user state was scoped to a ParDo and was
not shareable across multiple ParDos. Wouldn't rewindowing require the
usage of multiple ParDos and hence not allow for state to be shared?

On Tue, Jan 10, 2017 at 10:51 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> Possibly this could be handled by rewindowing and the current semantics. If
> not, maybe treat state like a side input with its own windowing and window
> mapping fn.
>
> On Jan 10, 2017 3:14 PM, "Ben Chambers (JIRA)"  wrote:
>
> > Ben Chambers created BEAM-1261:
> > --
> >
> >  Summary: State API should allow state to be managed in
> > different windows
> >  Key: BEAM-1261
> >  URL: https://issues.apache.org/jira/browse/BEAM-1261
> >  Project: Beam
> >   Issue Type: Bug
> >   Components: beam-model, sdk-java-core
> > Reporter: Ben Chambers
> > Assignee: Kenneth Knowles
> >
> >
> > For example, even if the elements are being processed in fixed windows of
> > an hour, it may be desirable for the state to "roll over" between windows
> > (or be available to all windows).
> >
> > It will also be necessary to figure out when this state should be deleted
> > (TTL? maximum retention?)
> >
> > Another problem is how to deal with out of order data. If data comes in
> > from the 10:00 AM window, should its state changes be visible to the data
> > in the 9:00 AM window?
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>


Re: splitIntoBundles vs. generateInitialSplits

2017-01-11 Thread Jean-Baptiste Onofré

Hi Eugene and Stas,

Just back from couple of days off and jump on this discussion.

I agree with Stas: it's worth to create a Jira about that. The only 
"semantic" difference is unbounded vs bounded source, but the behavior 
is the same.


Regards
JB

On 01/11/2017 04:26 PM, Stas Levin wrote:

Eugene, that makes a lot of sense to me.

Do you think it's worth filing a Jira ticket?

On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
 wrote:

I agree that the methods are named somewhat confusingly, and ideally would
be named the same. Both of the names miss some aspect of the underlying
concept.

The underlying concept is split the source into smaller sub-sources which,
if you read all of them, would have read the same data as the original one.
"splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
false in streaming, and only partially true in batch (I'm talking about the
Dataflow runner).
"generateInitialSplits" assumes that this splitting happens only
"initially", i.e. at job startup time. This is currently true in practice
for all existing runners, but it doesn't have to be - we could conceivably
call it again at some point during the job if we see that some of the
sub-sources are still too large.

The analogous method in Splittable DoFn (
https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
there are no restrictions in source API, only sources.

Perhaps both should be called simply "split", or "splitIntoSubSources".

On Mon, Jan 9, 2017 at 2:12 PM Stas Levin  wrote:


Definitely seems like the formatting got lost in translation, sorry about
that :)

I guess both cases (methods) create splits, which are essentially a list

of

bounded/unbounded source instances, each responsible for reading certain
segments (physical or otherwise) of the data.

On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk 
wrote:


hi!

I think your strikethrough got lost due to this being a text-only email
list. To make sure, I think you're asking the following:
" would it be reasonable to think of splitIntoBundles as generateSplits?

"

(ie, you strikethrough'd Initial)

They are very similar and I definitely also think of them as occupying

the

same niche. I'll let someone else who was around for naming discuss

whether

it was intentional or not. Conceptually, the way that bounded vs

streaming

are handled means that they are doing slightly different things: a

bounded

source is really kind of creating physical chunks of the data, whereas

the

streaming source is creating conceptual divisions of the data that will

be

used later. I'm not sure that's worth the confusion caused by the
differences.

One thing to clarify - splitIntoBundles does have an "Initial" aspect to
it. I don't believe there is a publicly defined/written down order the
Sources & Reader methods are called in, but a runner trying to get
efficiency would be able to use splitIntoBundles during job startup to

be

able to split up the work before creating readers rather than after
creating readers and waiting to use splitAtFraction.

S

On Sun, Jan 8, 2017 at 6:06 AM Stas Levin  wrote:


Hi,

A short terminology question regarding "bundle", and
particularly splitIntoBundles vs. generateInitialSplits.

In *BoundedSource* we have:
List> *splitIntoBundles*(...)

In *UnboundedSource* we have:
List>
*generateInitialSplits*(...)

I was wondering if the names were intentionally made different, i.e.

"into

bundles" vs "into splits"?
In a way these two methods carry out a very similar task, would it be
reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?

*

(strikethrough due to "initial" not being applicable in the case of

bounded

sources)

Regards,
Stas









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: splitIntoBundles vs. generateInitialSplits

2017-01-11 Thread Stas Levin
Eugene, that makes a lot of sense to me.

Do you think it's worth filing a Jira ticket?

On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
 wrote:

I agree that the methods are named somewhat confusingly, and ideally would
be named the same. Both of the names miss some aspect of the underlying
concept.

The underlying concept is split the source into smaller sub-sources which,
if you read all of them, would have read the same data as the original one.
"splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
false in streaming, and only partially true in batch (I'm talking about the
Dataflow runner).
"generateInitialSplits" assumes that this splitting happens only
"initially", i.e. at job startup time. This is currently true in practice
for all existing runners, but it doesn't have to be - we could conceivably
call it again at some point during the job if we see that some of the
sub-sources are still too large.

The analogous method in Splittable DoFn (
https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
there are no restrictions in source API, only sources.

Perhaps both should be called simply "split", or "splitIntoSubSources".

On Mon, Jan 9, 2017 at 2:12 PM Stas Levin  wrote:

> Definitely seems like the formatting got lost in translation, sorry about
> that :)
>
> I guess both cases (methods) create splits, which are essentially a list
of
> bounded/unbounded source instances, each responsible for reading certain
> segments (physical or otherwise) of the data.
>
> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk 
> wrote:
>
> > hi!
> >
> > I think your strikethrough got lost due to this being a text-only email
> > list. To make sure, I think you're asking the following:
> > " would it be reasonable to think of splitIntoBundles as generateSplits?
> "
> > (ie, you strikethrough'd Initial)
> >
> > They are very similar and I definitely also think of them as occupying
> the
> > same niche. I'll let someone else who was around for naming discuss
> whether
> > it was intentional or not. Conceptually, the way that bounded vs
> streaming
> > are handled means that they are doing slightly different things: a
> bounded
> > source is really kind of creating physical chunks of the data, whereas
> the
> > streaming source is creating conceptual divisions of the data that will
> be
> > used later. I'm not sure that's worth the confusion caused by the
> > differences.
> >
> > One thing to clarify - splitIntoBundles does have an "Initial" aspect to
> > it. I don't believe there is a publicly defined/written down order the
> > Sources & Reader methods are called in, but a runner trying to get
> > efficiency would be able to use splitIntoBundles during job startup to
be
> > able to split up the work before creating readers rather than after
> > creating readers and waiting to use splitAtFraction.
> >
> > S
> >
> > On Sun, Jan 8, 2017 at 6:06 AM Stas Levin  wrote:
> >
> > > Hi,
> > >
> > > A short terminology question regarding "bundle", and
> > > particularly splitIntoBundles vs. generateInitialSplits.
> > >
> > > In *BoundedSource* we have:
> > > List> *splitIntoBundles*(...)
> > >
> > > In *UnboundedSource* we have:
> > > List>
> > > *generateInitialSplits*(...)
> > >
> > > I was wondering if the names were intentionally made different, i.e.
> > "into
> > > bundles" vs "into splits"?
> > > In a way these two methods carry out a very similar task, would it be
> > > reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
> *
> > > (strikethrough due to "initial" not being applicable in the case of
> > bounded
> > > sources)
> > >
> > > Regards,
> > > Stas
> > >
> >
>


Re: Graduation!

2017-01-11 Thread amarouni
Congratulations to everyone on this important milestone.


On 11/01/2017 11:52, Neelesh Salian wrote:
> Congratulations to the community. :)
>
> On Jan 11, 2017 3:37 PM, "Stephan Ewen"  wrote:
>
>> Very nice :-)
>>
>> Good to see this happening!
>>
>> On Tue, Jan 10, 2017 at 11:58 PM, Tyler Akidau > wrote:
>>
>>> Congrats and thanks to everyone who helped make this happen! :-D
>>>
>>> -Tyler
>>>
>>> On Tue, Jan 10, 2017 at 2:20 PM Kenneth Knowles 
>>> wrote:
>>>
>>> This is really exciting. It is such a privilege to be involved with this
>>> project & community.
>>>
>>> Kenn
>>>
>>> On Tue, Jan 10, 2017 at 1:42 PM, JongYoon Lim 
>>> wrote:
>>>
 Congrats to everyone involved !

 Best Regards,
 JongYoon

 2017-01-11 7:18 GMT+13:00 Jean-Baptiste Onofré :

> Congrats to the team !!
>
> I'm proud and glad to humbly be part of it.
>
> Regards
> JB⁣​
>
> On Jan 10, 2017, 19:09, at 19:09, Raghu Angadi
 
> wrote:
>> Congrats to everyone involved.
>>
>> It has been a great experience following the rapid progress of Beam
>>> and
>> hard work of many. Well deserved promotion.
>>
>>
>> On Tue, Jan 10, 2017 at 3:07 AM, Davor Bonaci 
>>> wrote:
>>> The ASF has publicly announced our graduation!
>>>
>>>
>>> https://blogs.apache.org/foundation/entry/the-apache-
>>> software-foundation-announces
>>>
>>> https://beam.apache.org/blog/2017/01/10/beam-graduates.html
>>>
>>> Graduation is a recognition of the community that we have built
>> together. I
>>> am humbled to be part of this group and this project, and so
>> excited
>> for
>>> what we can accomplish together going forward.
>>>
>>> Davor
>>>



Re: Graduation!

2017-01-11 Thread Neelesh Salian
Congratulations to the community. :)

On Jan 11, 2017 3:37 PM, "Stephan Ewen"  wrote:

> Very nice :-)
>
> Good to see this happening!
>
> On Tue, Jan 10, 2017 at 11:58 PM, Tyler Akidau  >
> wrote:
>
> > Congrats and thanks to everyone who helped make this happen! :-D
> >
> > -Tyler
> >
> > On Tue, Jan 10, 2017 at 2:20 PM Kenneth Knowles 
> > wrote:
> >
> > This is really exciting. It is such a privilege to be involved with this
> > project & community.
> >
> > Kenn
> >
> > On Tue, Jan 10, 2017 at 1:42 PM, JongYoon Lim 
> > wrote:
> >
> > > Congrats to everyone involved !
> > >
> > > Best Regards,
> > > JongYoon
> > >
> > > 2017-01-11 7:18 GMT+13:00 Jean-Baptiste Onofré :
> > >
> > > > Congrats to the team !!
> > > >
> > > > I'm proud and glad to humbly be part of it.
> > > >
> > > > Regards
> > > > JB⁣​
> > > >
> > > > On Jan 10, 2017, 19:09, at 19:09, Raghu Angadi
> > > 
> > > > wrote:
> > > > >Congrats to everyone involved.
> > > > >
> > > > >It has been a great experience following the rapid progress of Beam
> > and
> > > > >hard work of many. Well deserved promotion.
> > > > >
> > > > >
> > > > >On Tue, Jan 10, 2017 at 3:07 AM, Davor Bonaci 
> > wrote:
> > > > >
> > > > >> The ASF has publicly announced our graduation!
> > > > >>
> > > > >>
> > > > >> https://blogs.apache.org/foundation/entry/the-apache-
> > > > >> software-foundation-announces
> > > > >>
> > > > >> https://beam.apache.org/blog/2017/01/10/beam-graduates.html
> > > > >>
> > > > >> Graduation is a recognition of the community that we have built
> > > > >together. I
> > > > >> am humbled to be part of this group and this project, and so
> excited
> > > > >for
> > > > >> what we can accomplish together going forward.
> > > > >>
> > > > >> Davor
> > > > >>
> > > >
> > >
> >
>


Re: Graduation!

2017-01-11 Thread Stephan Ewen
Very nice :-)

Good to see this happening!

On Tue, Jan 10, 2017 at 11:58 PM, Tyler Akidau 
wrote:

> Congrats and thanks to everyone who helped make this happen! :-D
>
> -Tyler
>
> On Tue, Jan 10, 2017 at 2:20 PM Kenneth Knowles 
> wrote:
>
> This is really exciting. It is such a privilege to be involved with this
> project & community.
>
> Kenn
>
> On Tue, Jan 10, 2017 at 1:42 PM, JongYoon Lim 
> wrote:
>
> > Congrats to everyone involved !
> >
> > Best Regards,
> > JongYoon
> >
> > 2017-01-11 7:18 GMT+13:00 Jean-Baptiste Onofré :
> >
> > > Congrats to the team !!
> > >
> > > I'm proud and glad to humbly be part of it.
> > >
> > > Regards
> > > JB⁣​
> > >
> > > On Jan 10, 2017, 19:09, at 19:09, Raghu Angadi
> > 
> > > wrote:
> > > >Congrats to everyone involved.
> > > >
> > > >It has been a great experience following the rapid progress of Beam
> and
> > > >hard work of many. Well deserved promotion.
> > > >
> > > >
> > > >On Tue, Jan 10, 2017 at 3:07 AM, Davor Bonaci 
> wrote:
> > > >
> > > >> The ASF has publicly announced our graduation!
> > > >>
> > > >>
> > > >> https://blogs.apache.org/foundation/entry/the-apache-
> > > >> software-foundation-announces
> > > >>
> > > >> https://beam.apache.org/blog/2017/01/10/beam-graduates.html
> > > >>
> > > >> Graduation is a recognition of the community that we have built
> > > >together. I
> > > >> am humbled to be part of this group and this project, and so excited
> > > >for
> > > >> what we can accomplish together going forward.
> > > >>
> > > >> Davor
> > > >>
> > >
> >
>