Re: [DISCUSS] Beam data plane serialization tech

2016-06-17 Thread Lukasz Cwik
In the Runner API proposal doc, there are 10+ different types with several
fields each.
Is it important to have a code generator for the schema?
* simplify the SDK development process
* reduce errors due to differences in custom implementation

I'm not familiar with tool(s) which can take a JSON schema (e.g.
http://json-schema.org/) and generate code in multiple languages. Anyone?


For the Data Plane API, a Runner and SDK must be able to encode elements
such as WindowedValue and KVs in such a way that both sides can interpret
them. For example, a Runner will be required to implement GBK so it must be
able to read the windowing information from the "bytes" transmitted,
additionally it will need to be able to split KV records apart and
recreate KV> for the SDK. Since Coders are the dominant way
of encoding things, the Data Plane API will transmit "bytes" with the
element boundaries encoded in some way. Aljoscha, I agree with you that a
good choice for transmitting bytes between VMs/languages is very important.
Even though we are still transmitting mostly "bytes", error handling &
connection handling are still important.
For example, if we were to use gRPC and proto3 with a bidirectional stream
based API, we would get:
the Runner and SDK can both push data both ways (stream from/to GBK, stream
from/to state)
error handling
code generation of client libraries
HTTP/2

As for the encoding, any SDK can choose any serialization it wants such as
Kryo but to get interoperability with other languages that would require
others to implement parts of the Kryo serialization spec to be able to
interpret the "bytes". Thus certain types like KV & WindowedValue should be
encoded in a way which allows for this interoperability.






On Fri, Jun 17, 2016 at 3:20 AM, Amit Sela  wrote:

> +1 on Aljoscha comment, not sure where's the benefit in having a
> "schematic" serialization.
>
> I know that Spark and I think Flink as well, use Kryo
>  for serialization (to be
> accurate it's Chill  for Spark) and I
> found it very impressive even comparing to "manual" serializations,
>  i.e., it seems to outperform Spark's "native" Encoders (1.6+) for
> primitives..
> In addition it clearly supports Java and Scala, and there are 3rd party
> libraries for Clojure and Objective-C.
>
> I guess my bottom-line here agrees with Kenneth - performance and
> interoperability - but I'm just not sure if schema based serializers are
> *always* the fastest.
>
> As for pipeline serialization, since performance is not the main issue, and
> I think usability would be very important, I say +1 for JSON.
>
> For anyone who spent sometime on benchmarking serialization libraries, know
> is the time to speak up ;)
>
> Thanks,
> Amit
>
> On Fri, Jun 17, 2016 at 12:47 PM Aljoscha Krettek 
> wrote:
>
> > Hi,
> > am I correct in assuming that the transmitted envelopes would mostly
> > contain coder-serialized values? If so, wouldn't the header of an
> envelope
> > just be the number of contained bytes and number of values? I'm probably
> > missing something but with these assumptions I don't see the benefit of
> > using something like Avro/Thrift/Protobuf for serializing the main-input
> > value envelopes. We would just need a system that can send byte data
> really
> > fast between languages/VMs.
> >
> > By the way, another interesting question (at least for me) is how other
> > data, such as side-inputs, is going to arrive at the DoFn if we want to
> > support a general interface for different languages.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles 
> > wrote:
> >
> > > (Apologies for the formatting)
> > >
> > > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles 
> > wrote:
> > >
> > > > Hello everyone!
> > > >
> > > > We are busily working on a Runner API (for building and transmitting
> > > > pipelines)
> > > > and a Fn API (for invoking user-defined functions found within
> > pipelines)
> > > > as
> > > > outlined in the Beam technical vision [1]. Both of these require a
> > > > language-independent serialization technology for interoperability
> > > between
> > > > SDKs
> > > > and runners.
> > > >
> > > > The Fn API includes a high-bandwidth data plane where bundles are
> > > > transmitted
> > > > via some serialization/RPC envelope (inside the envelope, the stream
> of
> > > > elements is encoded with a coder) to transfer bundles between the
> > runner
> > > > and
> > > > the SDK, so performance is extremely important. There are many
> choices
> > > for
> > > > high
> > > > performance serialization, and we would like to start the
> conversation
> > > > about
> > > > what serialization technology is best for Beam.
> > > >
> > > > The goal of this discussion is to arrive at consensus on the
> question:
> > > > What
> > > > serialization technology should we use for the data plane envelope of
> > the
> > > > Fn
> > > > API?
> > > >
> > > > To faci

Re: Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-17 Thread Thomas Groh
Generally, the above code snippet will work, producing (after trigger
firing) an output Iterable containing all of the input elements. It may
be notable that timers (and TimerInternals) are also per-key, so that
interface must also be updated per element.

By specifying the ReduceFn of the ReduceFnRunner, you can change how the
ReduceFnRunner adds and merges state. The combining ReduceFn is suitable
for use with upstream CombineFns, while buffering is suitable for general
use.

On Fri, Jun 17, 2016 at 9:52 AM, Thomas Weise 
wrote:

> The source for my windowed groupByKey experiment is here:
>
>
> https://github.com/tweise/incubator-beam/blob/master/runners/apex/src/main/java/org/apache/beam/runners/apex/translators/functions/ApexGroupByKeyOperator.java
>
> The result is Iterable. In cases such as counting, what is the
> recommended way to perform the incremental aggregation, without building an
> intermediate collection?
>
> Thomas
>
> On Fri, Jun 17, 2016 at 8:27 AM, Thomas Weise 
> wrote:
>
> > Hi,
> >
> > I'm looking into using the GroupAlsoByWindowViaWindowSetDoFn to
> accumulate
> > the windowed state with the elements arriving one by one (stream).
> >
> > Once the window is complete, I would like to emit an Iterable or
> > another form of aggregation of the elements. Is the following supposed to
> > lead to merging of current element with previously received elements for
> > the same window:
> >
> > KeyedWorkItem kwi = KeyedWorkItems.elementsWorkItem(
> > kv.getKey(),
> > Collections.singletonList(updatedWindowedValue));
> >
> > context.setElement(kwi, getStateInternalsForKey(kwi.key()));
> > fn.processElement(context);
> >
> > The input here are always single elements.
> >
> > Thanks,
> > Thomas
> >
> >
> >
>


Re: Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-17 Thread Thomas Weise
The source for my windowed groupByKey experiment is here:

https://github.com/tweise/incubator-beam/blob/master/runners/apex/src/main/java/org/apache/beam/runners/apex/translators/functions/ApexGroupByKeyOperator.java

The result is Iterable. In cases such as counting, what is the
recommended way to perform the incremental aggregation, without building an
intermediate collection?

Thomas

On Fri, Jun 17, 2016 at 8:27 AM, Thomas Weise 
wrote:

> Hi,
>
> I'm looking into using the GroupAlsoByWindowViaWindowSetDoFn to accumulate
> the windowed state with the elements arriving one by one (stream).
>
> Once the window is complete, I would like to emit an Iterable or
> another form of aggregation of the elements. Is the following supposed to
> lead to merging of current element with previously received elements for
> the same window:
>
> KeyedWorkItem kwi = KeyedWorkItems.elementsWorkItem(
> kv.getKey(),
> Collections.singletonList(updatedWindowedValue));
>
> context.setElement(kwi, getStateInternalsForKey(kwi.key()));
> fn.processElement(context);
>
> The input here are always single elements.
>
> Thanks,
> Thomas
>
>
>


Re: [NOTICE] Change on Filter

2016-06-17 Thread Frances Perry
Release notes for each release are being tracked in JIRA. For example:
https://issues.apache.org/jira/browse/BEAM/fixforversion/12335764/
Davor is planning to send a follow up email about how we use this process.
And as we redo the website layout, we should figure out how to surface this
information cleanly there too.

As for breaking changes from the Dataflow Java 1.x, we'll make sure to
publish a very clear guide when we start encouraging people to transition.
There will be a few small things in this space (package refactorings,
renamings, how often DoFn is deserialized) -- this is our chance to fix
some things we wish we'd done differently ;-)


On Fri, Jun 17, 2016 at 2:18 AM, Jean-Baptiste Onofré 
wrote:

>
>
> So it will go in the RELEASE NOTES for next release: fair enough. I was
> just a bit surprised as I missed this jira.
> Thanks !RegardsJB
>
>
> Sent from my Samsung device
>
>  Original message 
> From: Aljoscha Krettek 
> Date: 17/06/2016  10:58  (GMT+01:00)
> To: dev@beam.incubator.apache.org
> Subject: Re: [NOTICE] Change on Filter
>
> There has been an issue about this for a while now:
> https://issues.apache.org/jira/browse/BEAM-234
>
> On Fri, 17 Jun 2016 at 09:55 Jean-Baptiste Onofré  wrote:
>
> > Hi Ismaël,
> >
> > I didn't talk a change between Dataflow SDK and Beam, I'm talking about
> > a change between two Beam SNAPSHOTs ;)
> >
> > For the naming of the DirectRunner, I saw it also, and we should align
> > the runners naming (we have FlinkPipelineRunner and
> > SparkPipelineRunner). I sent an e-mail to Davor and Frances to discuss
> > about that.
> >
> > Regards
> > JB
> >
> > On 06/17/2016 09:42 AM, Ismaël Mejía wrote:
> > > Do we have a list of breaking changes (from the Google Dataflow SDK to
> > > Beam), because this is going to be important considering other recent
> > > breaking changes, for example this two that I found yesterday too:
> > >
> > > DirectPipelineRunner -> DirectRunner
> > > DoFnTester.processBatch -> DoFnTester.processBundle (?)
> > >
> > > Ismael.
> > >
> > >
> > >
> > >
> > > On Fri, Jun 17, 2016 at 9:19 AM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > >> Hi guys,
> > >>
> > >> I tested the latest Beam SNAPSHOT this morning and a code which was
> > >> working yesterday is not broken with the last changes.
> > >>
> > >> I'm using a filter by predicate:
> > >>
> > >>  .apply("Filtering", Filter.byPredicate(new
> > >> SerializableFunction() {
> > >>  public Boolean apply(String input) {
> > >>  ...
> > >>  })).
> > >>
> > >> The filter method has been renamed from byPredicate() to by().
> > >>
> > >> Just to let others know that it can impact their pipelines.
> > >>
> > >> Regards
> > >> JB
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-17 Thread Thomas Weise
Hi,

I'm looking into using the GroupAlsoByWindowViaWindowSetDoFn to accumulate
the windowed state with the elements arriving one by one (stream).

Once the window is complete, I would like to emit an Iterable or another
form of aggregation of the elements. Is the following supposed to lead to
merging of current element with previously received elements for the same
window:

KeyedWorkItem kwi = KeyedWorkItems.elementsWorkItem(
kv.getKey(),
Collections.singletonList(updatedWindowedValue));

context.setElement(kwi, getStateInternalsForKey(kwi.key()));
fn.processElement(context);

The input here are always single elements.

Thanks,
Thomas


Re: [DISCUSS] Beam data plane serialization tech

2016-06-17 Thread Amit Sela
+1 on Aljoscha comment, not sure where's the benefit in having a
"schematic" serialization.

I know that Spark and I think Flink as well, use Kryo
 for serialization (to be
accurate it's Chill  for Spark) and I
found it very impressive even comparing to "manual" serializations,
 i.e., it seems to outperform Spark's "native" Encoders (1.6+) for
primitives..
In addition it clearly supports Java and Scala, and there are 3rd party
libraries for Clojure and Objective-C.

I guess my bottom-line here agrees with Kenneth - performance and
interoperability - but I'm just not sure if schema based serializers are
*always* the fastest.

As for pipeline serialization, since performance is not the main issue, and
I think usability would be very important, I say +1 for JSON.

For anyone who spent sometime on benchmarking serialization libraries, know
is the time to speak up ;)

Thanks,
Amit

On Fri, Jun 17, 2016 at 12:47 PM Aljoscha Krettek 
wrote:

> Hi,
> am I correct in assuming that the transmitted envelopes would mostly
> contain coder-serialized values? If so, wouldn't the header of an envelope
> just be the number of contained bytes and number of values? I'm probably
> missing something but with these assumptions I don't see the benefit of
> using something like Avro/Thrift/Protobuf for serializing the main-input
> value envelopes. We would just need a system that can send byte data really
> fast between languages/VMs.
>
> By the way, another interesting question (at least for me) is how other
> data, such as side-inputs, is going to arrive at the DoFn if we want to
> support a general interface for different languages.
>
> Cheers,
> Aljoscha
>
> On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles 
> wrote:
>
> > (Apologies for the formatting)
> >
> > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles 
> wrote:
> >
> > > Hello everyone!
> > >
> > > We are busily working on a Runner API (for building and transmitting
> > > pipelines)
> > > and a Fn API (for invoking user-defined functions found within
> pipelines)
> > > as
> > > outlined in the Beam technical vision [1]. Both of these require a
> > > language-independent serialization technology for interoperability
> > between
> > > SDKs
> > > and runners.
> > >
> > > The Fn API includes a high-bandwidth data plane where bundles are
> > > transmitted
> > > via some serialization/RPC envelope (inside the envelope, the stream of
> > > elements is encoded with a coder) to transfer bundles between the
> runner
> > > and
> > > the SDK, so performance is extremely important. There are many choices
> > for
> > > high
> > > performance serialization, and we would like to start the conversation
> > > about
> > > what serialization technology is best for Beam.
> > >
> > > The goal of this discussion is to arrive at consensus on the question:
> > > What
> > > serialization technology should we use for the data plane envelope of
> the
> > > Fn
> > > API?
> > >
> > > To facilitate community discussion, we looked at the available
> > > technologies and
> > > tried to narrow the choices based on three criteria:
> > >
> > >  - Performance: What is the size of serialized data? How do we expect
> the
> > >technology to affect pipeline speed and cost? etc
> > >
> > >  - Language support: Does the technology support the most widespread
> > > language
> > >for data processing? Does it have a vibrant ecosystem of contributed
> > >language bindings? etc
> > >
> > >  - Community: What is the adoption of the technology? How mature is it?
> > > How
> > >active is development? How is the documentation? etc
> > >
> > > Given these criteria, we came up with four technologies that are good
> > > contenders. All have similar & adequate schema capabilities.
> > >
> > >  - Apache Avro: Does not require code gen, but embedding the schema in
> > the
> > > data
> > >could be an issue. Very popular.
> > >
> > >  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> > > number of
> > >language supported.
> > >
> > >  - Protocol Buffers 3: Incorporates the lessons that Google has learned
> > > through
> > >long-term use of Protocol Buffers.
> > >
> > >  - FlatBuffers: Some benchmarks imply great performance from the
> > zero-copy
> > > mmap
> > >idea. We would need to run representative experiments.
> > >
> > > I want to emphasize that this is a community decision, and this thread
> is
> > > just
> > > the conversation starter for us all to weigh in. We just wanted to do
> > some
> > > legwork to focus the discussion if we could.
> > >
> > > And there's a minor follow-up question: Once we settle here, is that
> > > technology
> > > also suitable for the low-bandwidth Runner API for defining pipelines,
> or
> > > does
> > > anyone think we need to consider a second technology (like JSON) for
> > > usability
> > > reasons?
> > >
> > > [1]
> > >
> >
> https://docs.google.com/presentation/d/

Re: [DISCUSS] Beam data plane serialization tech

2016-06-17 Thread Aljoscha Krettek
Hi,
am I correct in assuming that the transmitted envelopes would mostly
contain coder-serialized values? If so, wouldn't the header of an envelope
just be the number of contained bytes and number of values? I'm probably
missing something but with these assumptions I don't see the benefit of
using something like Avro/Thrift/Protobuf for serializing the main-input
value envelopes. We would just need a system that can send byte data really
fast between languages/VMs.

By the way, another interesting question (at least for me) is how other
data, such as side-inputs, is going to arrive at the DoFn if we want to
support a general interface for different languages.

Cheers,
Aljoscha

On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles  wrote:

> (Apologies for the formatting)
>
> On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles  wrote:
>
> > Hello everyone!
> >
> > We are busily working on a Runner API (for building and transmitting
> > pipelines)
> > and a Fn API (for invoking user-defined functions found within pipelines)
> > as
> > outlined in the Beam technical vision [1]. Both of these require a
> > language-independent serialization technology for interoperability
> between
> > SDKs
> > and runners.
> >
> > The Fn API includes a high-bandwidth data plane where bundles are
> > transmitted
> > via some serialization/RPC envelope (inside the envelope, the stream of
> > elements is encoded with a coder) to transfer bundles between the runner
> > and
> > the SDK, so performance is extremely important. There are many choices
> for
> > high
> > performance serialization, and we would like to start the conversation
> > about
> > what serialization technology is best for Beam.
> >
> > The goal of this discussion is to arrive at consensus on the question:
> > What
> > serialization technology should we use for the data plane envelope of the
> > Fn
> > API?
> >
> > To facilitate community discussion, we looked at the available
> > technologies and
> > tried to narrow the choices based on three criteria:
> >
> >  - Performance: What is the size of serialized data? How do we expect the
> >technology to affect pipeline speed and cost? etc
> >
> >  - Language support: Does the technology support the most widespread
> > language
> >for data processing? Does it have a vibrant ecosystem of contributed
> >language bindings? etc
> >
> >  - Community: What is the adoption of the technology? How mature is it?
> > How
> >active is development? How is the documentation? etc
> >
> > Given these criteria, we came up with four technologies that are good
> > contenders. All have similar & adequate schema capabilities.
> >
> >  - Apache Avro: Does not require code gen, but embedding the schema in
> the
> > data
> >could be an issue. Very popular.
> >
> >  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> > number of
> >language supported.
> >
> >  - Protocol Buffers 3: Incorporates the lessons that Google has learned
> > through
> >long-term use of Protocol Buffers.
> >
> >  - FlatBuffers: Some benchmarks imply great performance from the
> zero-copy
> > mmap
> >idea. We would need to run representative experiments.
> >
> > I want to emphasize that this is a community decision, and this thread is
> > just
> > the conversation starter for us all to weigh in. We just wanted to do
> some
> > legwork to focus the discussion if we could.
> >
> > And there's a minor follow-up question: Once we settle here, is that
> > technology
> > also suitable for the low-bandwidth Runner API for defining pipelines, or
> > does
> > anyone think we need to consider a second technology (like JSON) for
> > usability
> > reasons?
> >
> > [1]
> >
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
> >
> >
>


Re: [NOTICE] Change on Filter

2016-06-17 Thread Jean-Baptiste Onofré


So it will go in the RELEASE NOTES for next release: fair enough. I was just a 
bit surprised as I missed this jira.
Thanks !RegardsJB 


Sent from my Samsung device

 Original message 
From: Aljoscha Krettek  
Date: 17/06/2016  10:58  (GMT+01:00) 
To: dev@beam.incubator.apache.org 
Subject: Re: [NOTICE] Change on Filter 

There has been an issue about this for a while now:
https://issues.apache.org/jira/browse/BEAM-234

On Fri, 17 Jun 2016 at 09:55 Jean-Baptiste Onofré  wrote:

> Hi Ismaël,
>
> I didn't talk a change between Dataflow SDK and Beam, I'm talking about
> a change between two Beam SNAPSHOTs ;)
>
> For the naming of the DirectRunner, I saw it also, and we should align
> the runners naming (we have FlinkPipelineRunner and
> SparkPipelineRunner). I sent an e-mail to Davor and Frances to discuss
> about that.
>
> Regards
> JB
>
> On 06/17/2016 09:42 AM, Ismaël Mejía wrote:
> > Do we have a list of breaking changes (from the Google Dataflow SDK to
> > Beam), because this is going to be important considering other recent
> > breaking changes, for example this two that I found yesterday too:
> >
> > DirectPipelineRunner -> DirectRunner
> > DoFnTester.processBatch -> DoFnTester.processBundle (?)
> >
> > Ismael.
> >
> >
> >
> >
> > On Fri, Jun 17, 2016 at 9:19 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> >> Hi guys,
> >>
> >> I tested the latest Beam SNAPSHOT this morning and a code which was
> >> working yesterday is not broken with the last changes.
> >>
> >> I'm using a filter by predicate:
> >>
> >>  .apply("Filtering", Filter.byPredicate(new
> >> SerializableFunction() {
> >>  public Boolean apply(String input) {
> >>  ...
> >>  })).
> >>
> >> The filter method has been renamed from byPredicate() to by().
> >>
> >> Just to let others know that it can impact their pipelines.
> >>
> >> Regards
> >> JB
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [NOTICE] Change on Filter

2016-06-17 Thread Aljoscha Krettek
There has been an issue about this for a while now:
https://issues.apache.org/jira/browse/BEAM-234

On Fri, 17 Jun 2016 at 09:55 Jean-Baptiste Onofré  wrote:

> Hi Ismaël,
>
> I didn't talk a change between Dataflow SDK and Beam, I'm talking about
> a change between two Beam SNAPSHOTs ;)
>
> For the naming of the DirectRunner, I saw it also, and we should align
> the runners naming (we have FlinkPipelineRunner and
> SparkPipelineRunner). I sent an e-mail to Davor and Frances to discuss
> about that.
>
> Regards
> JB
>
> On 06/17/2016 09:42 AM, Ismaël Mejía wrote:
> > Do we have a list of breaking changes (from the Google Dataflow SDK to
> > Beam), because this is going to be important considering other recent
> > breaking changes, for example this two that I found yesterday too:
> >
> > DirectPipelineRunner -> DirectRunner
> > DoFnTester.processBatch -> DoFnTester.processBundle (?)
> >
> > Ismael.
> >
> >
> >
> >
> > On Fri, Jun 17, 2016 at 9:19 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> >> Hi guys,
> >>
> >> I tested the latest Beam SNAPSHOT this morning and a code which was
> >> working yesterday is not broken with the last changes.
> >>
> >> I'm using a filter by predicate:
> >>
> >>  .apply("Filtering", Filter.byPredicate(new
> >> SerializableFunction() {
> >>  public Boolean apply(String input) {
> >>  ...
> >>  })).
> >>
> >> The filter method has been renamed from byPredicate() to by().
> >>
> >> Just to let others know that it can impact their pipelines.
> >>
> >> Regards
> >> JB
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [NOTICE] Change on Filter

2016-06-17 Thread Jean-Baptiste Onofré

Hi Ismaël,

I didn't talk a change between Dataflow SDK and Beam, I'm talking about 
a change between two Beam SNAPSHOTs ;)


For the naming of the DirectRunner, I saw it also, and we should align 
the runners naming (we have FlinkPipelineRunner and 
SparkPipelineRunner). I sent an e-mail to Davor and Frances to discuss 
about that.


Regards
JB

On 06/17/2016 09:42 AM, Ismaël Mejía wrote:

Do we have a list of breaking changes (from the Google Dataflow SDK to
Beam), because this is going to be important considering other recent
breaking changes, for example this two that I found yesterday too:

DirectPipelineRunner -> DirectRunner
DoFnTester.processBatch -> DoFnTester.processBundle (?)

Ismael.




On Fri, Jun 17, 2016 at 9:19 AM, Jean-Baptiste Onofré 
wrote:


Hi guys,

I tested the latest Beam SNAPSHOT this morning and a code which was
working yesterday is not broken with the last changes.

I'm using a filter by predicate:

 .apply("Filtering", Filter.byPredicate(new
SerializableFunction() {
 public Boolean apply(String input) {
 ...
 })).

The filter method has been renamed from byPredicate() to by().

Just to let others know that it can impact their pipelines.

Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [thread fork] Apache Beam & Google Cloud Dataflow

2016-06-17 Thread Ismaël Mejía
Hello Frances,

Thanks for clearing this out. I hope you (google) can make somehow this
official (maybe in the FAQ too), the effect that users can 'experiment' to
move their code bases into Beam (without support until the official
release). Anyway it is great to know that this works (at least from a
non-supported but technical feasible point of view).

Ismaël

On Fri, Jun 17, 2016 at 9:14 AM, Jean-Baptiste Onofré 
wrote:

> Hi Frances,
>
> thanks for the details (and I like your Google hat ;)). I was more talking
> "technically speaking" ;)
>
> Regards
> JB
>
>
> On 06/17/2016 07:21 AM, Frances Perry wrote:
>
>> With my Google employee hat on, I'd like to soften that claim a little ;-)
>>
>> Currently, the Beam SDK runs again Google Cloud Dataflow. But since Beam
>> isn't itself ready for prime time yet, Google doesn't officially provide
>> support for running Beam on Cloud Dataflow right now, and Google Cloud
>> Dataflow customers should still use the original Dataflow Java SDK.
>>
>> But I, for one, am looking forward to this evolving over the next few
>> months as Beam stabilizes ;-D
>>
>>
>> On Thu, Jun 16, 2016 at 9:50 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi,
>>>
>>> as soon as you use the Beam dataflow runner, it should work smoothly.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 06/16/2016 10:05 PM, Ismaël Mejía wrote:
>>>
>>> Hello,

 One additional comment / question. I just noticed that Beam users
 already
 can write their Beam Pipelines and execute them in the google dataflow
 runner.

 I just did the test today and I was thrilled to confirm that it worked
 (as
 JB told me).

 You can look at the SDK version in the image:
 https://imgur.com/k9HnLnv

 The question is, is this some kind of beta, or is this going to be
 supported during the transition (before the formal release 1.0) ? I ask
 this because I suppose many current google users hesitate to move to
 Beam
 for the moment because they don't know that they can already run their
 pipelines in the Google Cloud Dataflow service. I think this is a good
 idea
 to encourage users to move their data processing pipelines into the Beam
 version.

 Regards,
 Ismaël




 On Wed, Jun 15, 2016 at 11:21 PM, James Malone <
 jamesmal...@google.com.invalid> wrote:

 Hi everyone,

>
> This is a thread fork from the email thread titled '[dev] Announcing
> 0.1.0-incubating release'.
>
> In that thread, Amir posed a good question:
>
>  Why is still "Google Cloud Dataflow" included in the Beam release
> if
> Beam is indeed
>  an evolution (super-set?) of "Google Cloud Dataflow".Thanks
> +regards,Amir-
>
> Many parts of Apache Beam are based on work from Google Cloud Dataflow,
> including the Dataflow (now Beam) model, SDKs (Java and Python), and
> some
> of the runners. This work was combined with awesome contributions from
> other groups (data Artisans/Apache Flink, Cloudera & PayPal/Apache
> Spark,
> etc.) to form the basis for Apache Beam[1]. Originally, the Cloud
> Dataflow
> SDK included machinery so Dataflow pipelines could be executed on
> Google
> Cloud Dataflow.
>
> An important part of Apache Beam is the ability to execute Beam
> pipelines
> on many runners (see the compatibility matrix[2] for full details and
> support.) The Beam project includes a runner for Google Cloud Dataflow,
> along with others, such as runners for Apache Flink and Apache Spark.
> We're
> also focused (and excited!) to support and grow new runners. As a
> seperate
> runner, the work for supporting execution on Cloud Dataflow can be
> separated into the runner from the larger Apache Beam effort.
>
> So, to summarize:
>
> Beam is based on work from Google Cloud Dataflow so it's definitely an
> evolution. Additionally, Beam includes a runner (one of many) for
> Google's
> Cloud Dataflow service.
>
> Hope that helps!
>
> James
>
> [1]: http://wiki.apache.org/incubator/BeamProposal
> [2]: http://beam.incubator.apache.org/capability-matrix
>
>
>
 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [NOTICE] Change on Filter

2016-06-17 Thread Ismaël Mejía
Do we have a list of breaking changes (from the Google Dataflow SDK to
Beam), because this is going to be important considering other recent
breaking changes, for example this two that I found yesterday too:

DirectPipelineRunner -> DirectRunner
DoFnTester.processBatch -> DoFnTester.processBundle (?)

Ismael.




On Fri, Jun 17, 2016 at 9:19 AM, Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> I tested the latest Beam SNAPSHOT this morning and a code which was
> working yesterday is not broken with the last changes.
>
> I'm using a filter by predicate:
>
> .apply("Filtering", Filter.byPredicate(new
> SerializableFunction() {
> public Boolean apply(String input) {
> ...
> })).
>
> The filter method has been renamed from byPredicate() to by().
>
> Just to let others know that it can impact their pipelines.
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


[NOTICE] Change on Filter

2016-06-17 Thread Jean-Baptiste Onofré

Hi guys,

I tested the latest Beam SNAPSHOT this morning and a code which was 
working yesterday is not broken with the last changes.


I'm using a filter by predicate:

.apply("Filtering", Filter.byPredicate(new 
SerializableFunction() {

public Boolean apply(String input) {
...
})).

The filter method has been renamed from byPredicate() to by().

Just to let others know that it can impact their pipelines.

Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [thread fork] Apache Beam & Google Cloud Dataflow

2016-06-17 Thread Jean-Baptiste Onofré

Hi Frances,

thanks for the details (and I like your Google hat ;)). I was more 
talking "technically speaking" ;)


Regards
JB

On 06/17/2016 07:21 AM, Frances Perry wrote:

With my Google employee hat on, I'd like to soften that claim a little ;-)

Currently, the Beam SDK runs again Google Cloud Dataflow. But since Beam
isn't itself ready for prime time yet, Google doesn't officially provide
support for running Beam on Cloud Dataflow right now, and Google Cloud
Dataflow customers should still use the original Dataflow Java SDK.

But I, for one, am looking forward to this evolving over the next few
months as Beam stabilizes ;-D


On Thu, Jun 16, 2016 at 9:50 PM, Jean-Baptiste Onofré 
wrote:


Hi,

as soon as you use the Beam dataflow runner, it should work smoothly.

Regards
JB


On 06/16/2016 10:05 PM, Ismaël Mejía wrote:


Hello,

One additional comment / question. I just noticed that Beam users already
can write their Beam Pipelines and execute them in the google dataflow
runner.

I just did the test today and I was thrilled to confirm that it worked (as
JB told me).

You can look at the SDK version in the image:
https://imgur.com/k9HnLnv

The question is, is this some kind of beta, or is this going to be
supported during the transition (before the formal release 1.0) ? I ask
this because I suppose many current google users hesitate to move to Beam
for the moment because they don't know that they can already run their
pipelines in the Google Cloud Dataflow service. I think this is a good
idea
to encourage users to move their data processing pipelines into the Beam
version.

Regards,
Ismaël




On Wed, Jun 15, 2016 at 11:21 PM, James Malone <
jamesmal...@google.com.invalid> wrote:

Hi everyone,


This is a thread fork from the email thread titled '[dev] Announcing
0.1.0-incubating release'.

In that thread, Amir posed a good question:

 Why is still "Google Cloud Dataflow" included in the Beam release if
Beam is indeed
 an evolution (super-set?) of "Google Cloud Dataflow".Thanks
+regards,Amir-

Many parts of Apache Beam are based on work from Google Cloud Dataflow,
including the Dataflow (now Beam) model, SDKs (Java and Python), and some
of the runners. This work was combined with awesome contributions from
other groups (data Artisans/Apache Flink, Cloudera & PayPal/Apache Spark,
etc.) to form the basis for Apache Beam[1]. Originally, the Cloud
Dataflow
SDK included machinery so Dataflow pipelines could be executed on Google
Cloud Dataflow.

An important part of Apache Beam is the ability to execute Beam pipelines
on many runners (see the compatibility matrix[2] for full details and
support.) The Beam project includes a runner for Google Cloud Dataflow,
along with others, such as runners for Apache Flink and Apache Spark.
We're
also focused (and excited!) to support and grow new runners. As a
seperate
runner, the work for supporting execution on Cloud Dataflow can be
separated into the runner from the larger Apache Beam effort.

So, to summarize:

Beam is based on work from Google Cloud Dataflow so it's definitely an
evolution. Additionally, Beam includes a runner (one of many) for
Google's
Cloud Dataflow service.

Hope that helps!

James

[1]: http://wiki.apache.org/incubator/BeamProposal
[2]: http://beam.incubator.apache.org/capability-matrix





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com