Re: [PROPOSAL] State and Timers for DoFn (aka per-key workflows)

2016-10-14 Thread Kenneth Knowles
Hi all,

I thought I would loop back on this proposal and email thread with an FYI
that coding has begun for this design. Here are some recent PRs for your
perusal, if you are interested.

https://github.com/apache/incubator-beam/pull/10
44 "Refactor StateSpec
out of StateTag
https://github.com/apache/incubator-beam/pull/1086 "Add DoFn.StateId
annotation and validation on fields"
https://github.com/apache/incubator-beam/pull/1102 "Add initial bits for
DoFn Timer API"

Kenn

On Fri, Jul 29, 2016 at 12:26 PM Jean-Baptiste Onofré 
wrote:

> +1
>
> It sounds very good.
>
> Regards
> JB
>
> On 07/27/2016 05:20 AM, Kenneth Knowles wrote:
> > Hi everyone,
> >
> >
> > I would like to offer a proposal for a much-requested feature in Beam:
> > Stateful processing in a DoFn. Please check out and comment on the
> proposal
> > at this URL:
> >
> >
> >   https://s.apache.org/beam-state
> >
> >
> > This proposal includes user-facing APIs for persistent state and timers.
> > Together, these provide rich capabilities that have been called "per-key
> > workflows", the subject of [BEAM-23].
> >
> >
> > Note that this proposal has an important prerequisite: a new design for
> > DoFn. The new DoFn is strongly motivated by this design for state and
> > timers, but we should discuss it separately. I will start a separate
> thread
> > for that.
> >
> >
> > On this email thread, I'd like to try to focus the discussion on state &
> > timers. And of course, please do comment on the particulars in the
> document.
> >
> >
> > Kenn
> >
> >
> > [BEAM-23] https://issues.apache.org/jira/browse/BEAM-23
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Adding display data to Python SDK

2016-10-14 Thread Pablo Estrada
Hello there,
I started working on adding display data to the Python SDK. This feature is
under Jira issue BEAM-722 .
I have a small commit that adds the infrastructure for this (linked in the
JIRA issue).
Feedback is welcome. If everyone is okay with my changes, I'll continue on
my path.
Best
-P.


Re: Specifying type arguments for generic PTransform builders

2016-10-14 Thread Robert Bradshaw
On Thu, Oct 13, 2016 at 10:36 PM, Eugene Kirpichov
 wrote:
> I think the choice between #1 or #3 is a red herring - the cases where #3
> is a better choice than #1 are few and far between, and probably not at all
> controversial (e.g. ParDo). So I suggest we drop this part of the
> discussion.

I decided to take a peek at the transforms we currently have, and
actually it seems that most of them fall into the category of having
zero or one "primary, required" arguments intrinsic to what the
transform is, and then possibly some optional ones. I'm becoming even
more a fan of #3--it makes it harder for the user to even construct an
invalid transform (and is better documenting too, both on docs and for
IDE autocompletion, about what's essential vs. the slew of optional
things).

We do loose the "database reader ready-to-go" bit, but I'm not sure
that's such a loss. One can instead have

class CompanyDefaults {
public static DatabaseIO.Read setup(DatabaseIO.Read reader)
}

which is actually superior if DatabaseIO.Read is a base class (or
interface) that may have several implementations.

> Looks like the main contenders for the complex case are #1 (Foo.blah())
> vs. #4 (Foo.Unbound and Foo.Bound).
>
> Dan, can you clarify again what you mean by this:
> "a utility function that gives you a database reader ready-to-go ... but
> independent of the type you want the result to end up as. You can't do
> that if you must bind the type first."
>
> I think this is quite doable with #1:
>
> class CompanyDefaults {
> public static  DatabaseIO.Read defaultDatabaseIO() { return
> DatabaseIO.create().withSettings(blah).withCredentials(blah); }
> }
>
> DatabaseIO.Read source =
> CompanyDefaults.defaultDatabaseIO().withQuery(blah);
>
> All in all, it seems to me that #1 (combined with potentially encapsulating
> parts of the configuration into separate objects, such as
> JdbcConnectionParameters in JdbcIO) is the only option that can do
> everything fairly well, its only downside is having to specify the type,

Having to repeat the type is a significant downside, especially when
your types get long. (Yes, I've faced types that get so verbose you
have to figure out where to put the line breaks.) This is why
inference of template arguments was added to the language, and the
whole reason for the existence of many of Guava's "constructors" like
Lists.newArrayList(), etc. (now obsolete due constructors allowing
inference).

> and it is very easy to implement when you're implementing your own
> transform - which, I agree with Kenn, matters a lot too.
>
> I think that coming up with an easy-to-implement, universally applicable
> pattern matters a lot also because the Beam ecosystem is open, and the set
> of connectors/transforms available to users will not always be as carefully
> curated and reviewed as it is currently - the argument "the implementation
> complexity doesn't matter because the user doesn't see it" will not apply.
> So, ultimately, "are there a lot of good-quality connectors available to
> Beam users" will be equivalent to "is it easy to develop a good-quality
> connector". And the consistency between APIs provided by different
> connectors will matter for the user experience, too.

+1

> On Thu, Oct 13, 2016 at 7:09 PM Kenneth Knowles 
> wrote:
>
>> On Thu, Oct 13, 2016 at 4:55 PM Dan Halperin 
>> wrote:
>> > These
>> > suggestions are motivated by making things easier on transform writers,
>> but
>> > IMO we need to be optimizing for transform users.
>>
>> To be fair to Eugene, he was actually analyzing real code patterns that
>> exists in Beam today, not suggesting new ones.
>>
>> Along those lines, your addition of the BigTableIO pattern is well-taken
>> and my own analysis of that code is #5: "when you don't have a type
>> variable to bind, leave every field blank and validate later. Also, have an
>> XYZOptions object". I believe in the presence of type parameters this
>> reduces to #4 Bound/Unbound classes but it is more palatable in the
>> no-type-variable case. It is also a good choice when varying subsets of
>> parameters might be optional - the Window transform matches this pattern
>> for good reason.
>>
>> The other major drawback of #3 is the inability to provide generic
>> > configuration. For example, a utility function that gives you a database
>> > reader ready-to-go with all the default credentials and options you need
>> --
>> > but independent of the type you want the result to end up as. You can't
>> do
>> > that if you must bind the type first.
>> >
>>
>> This is a compelling use case. It is valuable for configuration to be a
>> first-class object that can be passed around. BigTableOptions is a good
>> example. It isn't in contradiction with #3, but actually fits very nicely.
>>
>> By definition for this default configuration to be first-class it has to be
>> more than an invalid intermediate state of a 

Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-14 Thread Daniel Kulp

> On Oct 14, 2016, at 7:46 AM, Jean-Baptiste Onofré  wrote:
> I think we agreed on most of the points. We also agreed that points 4 & 5 
> should be a best effort and not "enforced”.

4 and 5 are really just needed when any “significant change” are part of the 
discussion.   Things like whitespace and typos and such don’t need to be pushed 
back to the list.  However, if you run into something that “just doesn’t work”, 
that should be reflected back.As an example, my first GridFS PR, when 
Eugene basically stated that the entire structure needed to change, (from doing 
everything in the Source to splitting into a simpler Source and a ParDo) that’s 
something that should have been brought back to the list in a summary.   

BTW:  this ALSO goes for anything in google docs.   Some of the proposals are 
in google doc form so any changes/discussions at google are also not “at 
Apache”.   In looking at the proposed workflow, I also think when the proposal 
is accepted/complete/whatever, the google doc needs to be exported to something 
and attached to the JIRA so it is properly archived at Apache.   Personally, 
I’d prefer limiting the use of google docs, but I do understand that having the 
diagrams can be important.   Not sure how to deal with that in email.  One 
thing the incubator does with proposals from their wiki is require the full 
TEXT of proposals be copied into the email (so searchable and also clear what 
version is be being looked at) with the link to the full doc.   

Dan


> 
> If there's no objection, I will create the review mailing list and update the 
> github integration configuration.
> 
> Thanks all for your comments and feebacks !
> Regards
> JB
> 
> On 10/06/2016 01:53 PM, Jean-Baptiste Onofré wrote:
>> Hi team,
>> 
>> following the discussion we had about technical discussion that should
>> happen on the mailing list, I would like to propose the following:
>> 
>> 1. We create a new mailing list: rev...@beam.incubator.apache.org.
>> 2. We configure github integration to send all pull request comments on
>> review mailing list. It would allow to track and simplify the way to
>> read the comments and to keep up to date.
>> 3. A technical discussion should be send on dev mailing list with the
>> [DISCUSS] keyword in the subject.
>> 4. Once a discussion is open, the author should periodically send an
>> update on the discussion (once a week) containing a summary of the last
>> exchanges happened on the Jira or github (quick and direct summary).
>> 5. Once we consider the discussion close (no update in the last two
>> weeks), the author send a [CLOSE] e-mail on the thread.
>> 
>> WDYT ?
>> 
>> Regards
>> JB
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

-- 
Daniel Kulp
dk...@apache.org - http://dankulp.com/blog
Talend Community Coder - http://coders.talend.com



Re: Documentation for IDE setup

2016-10-14 Thread Jean-Baptiste Onofré

I gonna merge.

Thanks.

On 10/14/2016 05:37 PM, Daniel Kulp wrote:



On Oct 14, 2016, at 10:06 AM, Jesse Anderson  wrote:

Last week I imported Beam with IntelliJ and everything worked.

That said, I tried to import the Eclipse project and that doesn't compile
anymore. I didn't have time to figure out what happened though.



I have a pull request https://github.com/apache/incubator-beam/pull/1094 that 
fixes the compile issues.  It has two LGTM’s, just needs someone to merge it.

With eclipse, you need to have all the needed m2e connectors.   Some of them 
(find bugs, check style) can be auto-detected and installed when beam is first 
imported.   The apt one doesn’t.   You need to go to the eclipse marketplace, 
install it, then configure it in the Eclipse properties to turn on the 
“experimental” m2e-apt processing.   Once you do that, a refresh of the maven 
projects should result in it building/compiling.

Running tests is another matter.   Since eclipse compiles everything in a 
module in one pass (instead of two like maven), one of the apt processors 
doesn’t know where to output files and always dumps the files in /classes 
instead of /test-classes.   Thus, any test that relies on a runner will likely 
fail as it results in the “test” versions of various services from core being 
picked up.  A simple:

rm sdks/java/core/target/classes/META-INF/services/*

From the command line will fix that.   That should also be documented on the 
IDE page until someone can figure out how to work around it.

Dan




On Fri, Oct 14, 2016 at 1:21 AM Jean-Baptiste Onofré 
wrote:


Hi Christian,

IntelliJ doesn't need any special config (maybe the code style can be
documented or imported).

Anyway, agree to add such on website in the contribute directory. I
think it could be part of the contribution-guide as it's first setup step.

Regards
JB

On 10/14/2016 10:17 AM, Christian Schneider wrote:

Hello all,

I am new to the beam community and currently start making myself
familiar with the code.  I quickly found the contribution guide and was
able to clone the code and build beam using maven.

The first obstacle I faced was getting the code build in eclipse. I
naively imported as existing maven projects but got lots of compile
errors. After talking to Dan Kulp we found that this is due to the apt
annotation processing for auto value types. Dan explained me how I need
to setup eclipse to make it work.

I still got 5 compile errors (Some bound mismatch at Read.bounded, and
one ambiguous method empty). These errors seem to be present for
everyone using eclipse and Dan works on it. So I think this is not a
permanent problem.

To make it easier for new people I would like to write a documentation
about the IDE setup. I can cover the eclipse part but I think intellij
should also be described.

I already started with it and placed it in /contribute/ide-setup. Does
that make sense?

I currently did not link to it from anywhere. I think it should be
linked in the contribute/index and in the Contribute menu.

Christian



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Documentation for IDE setup

2016-10-14 Thread Daniel Kulp

> On Oct 14, 2016, at 10:06 AM, Jesse Anderson  wrote:
> 
> Last week I imported Beam with IntelliJ and everything worked.
> 
> That said, I tried to import the Eclipse project and that doesn't compile
> anymore. I didn't have time to figure out what happened though.
> 

I have a pull request https://github.com/apache/incubator-beam/pull/1094 that 
fixes the compile issues.  It has two LGTM’s, just needs someone to merge it. 

With eclipse, you need to have all the needed m2e connectors.   Some of them 
(find bugs, check style) can be auto-detected and installed when beam is first 
imported.   The apt one doesn’t.   You need to go to the eclipse marketplace, 
install it, then configure it in the Eclipse properties to turn on the 
“experimental” m2e-apt processing.   Once you do that, a refresh of the maven 
projects should result in it building/compiling.

Running tests is another matter.   Since eclipse compiles everything in a 
module in one pass (instead of two like maven), one of the apt processors 
doesn’t know where to output files and always dumps the files in /classes 
instead of /test-classes.   Thus, any test that relies on a runner will likely 
fail as it results in the “test” versions of various services from core being 
picked up.  A simple:

rm sdks/java/core/target/classes/META-INF/services/*

From the command line will fix that.   That should also be documented on the 
IDE page until someone can figure out how to work around it.

Dan



> On Fri, Oct 14, 2016 at 1:21 AM Jean-Baptiste Onofré 
> wrote:
> 
>> Hi Christian,
>> 
>> IntelliJ doesn't need any special config (maybe the code style can be
>> documented or imported).
>> 
>> Anyway, agree to add such on website in the contribute directory. I
>> think it could be part of the contribution-guide as it's first setup step.
>> 
>> Regards
>> JB
>> 
>> On 10/14/2016 10:17 AM, Christian Schneider wrote:
>>> Hello all,
>>> 
>>> I am new to the beam community and currently start making myself
>>> familiar with the code.  I quickly found the contribution guide and was
>>> able to clone the code and build beam using maven.
>>> 
>>> The first obstacle I faced was getting the code build in eclipse. I
>>> naively imported as existing maven projects but got lots of compile
>>> errors. After talking to Dan Kulp we found that this is due to the apt
>>> annotation processing for auto value types. Dan explained me how I need
>>> to setup eclipse to make it work.
>>> 
>>> I still got 5 compile errors (Some bound mismatch at Read.bounded, and
>>> one ambiguous method empty). These errors seem to be present for
>>> everyone using eclipse and Dan works on it. So I think this is not a
>>> permanent problem.
>>> 
>>> To make it easier for new people I would like to write a documentation
>>> about the IDE setup. I can cover the eclipse part but I think intellij
>>> should also be described.
>>> 
>>> I already started with it and placed it in /contribute/ide-setup. Does
>>> that make sense?
>>> 
>>> I currently did not link to it from anywhere. I think it should be
>>> linked in the contribute/index and in the Contribute menu.
>>> 
>>> Christian
>>> 
>> 
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>> 

-- 
Daniel Kulp
dk...@apache.org - http://dankulp.com/blog
Talend Community Coder - http://coders.talend.com



Re: Documentation for IDE setup

2016-10-14 Thread Christian Schneider

Btw. I finished the documentation now and created a PR:
https://github.com/apache/incubator-beam-site/pull/44

While testing the documentation I also found this issue:
https://github.com/apache/incubator-beam-site/pull/45

Christian

On 14.10.2016 10:17, Christian Schneider wrote:

Hello all,

I am new to the beam community and currently start making myself 
familiar with the code.  I quickly found the contribution guide and 
was able to clone the code and build beam using maven.


The first obstacle I faced was getting the code build in eclipse. I 
naively imported as existing maven projects but got lots of compile 
errors. After talking to Dan Kulp we found that this is due to the apt 
annotation processing for auto value types. Dan explained me how I 
need to setup eclipse to make it work.


I still got 5 compile errors (Some bound mismatch at Read.bounded, and 
one ambiguous method empty). These errors seem to be present for 
everyone using eclipse and Dan works on it. So I think this is not a 
permanent problem.


To make it easier for new people I would like to write a documentation 
about the IDE setup. I can cover the eclipse part but I think intellij 
should also be described.


I already started with it and placed it in /contribute/ide-setup. Does 
that make sense?


I currently did not link to it from anywhere. I think it should be 
linked in the contribute/index and in the Contribute menu.


Christian




--
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com



Re: Documentation for IDE setup

2016-10-14 Thread Lukasz Cwik
I rely on having the Maven Eclipse integration and m2e-apt and do a maven
import of a project.

On Fri, Oct 14, 2016 at 8:10 AM, Jesse Anderson 
wrote:

> I did a "mvn eclipse:eclipse" to generate the Eclipse projects and imported
> them. That didn't compile either.
>
> On Fri, Oct 14, 2016 at 8:06 AM Lukasz Cwik 
> wrote:
>
> > I use Eclipse for development but always defer to maven since its the
> > source of truth in the end.
> > I also have issues with getting it to compile on import and it has to do
> > with annotation processing and generally requires m2e-apt to be installed
> > and configured correctly.
> >
> > On Fri, Oct 14, 2016 at 7:25 AM, Neelesh Salian 
> > wrote:
> >
> > > I was looking for the same couple of days ago.
> > > But IntelliJ is less worrisome than Eclipse.
> > >
> > > Straight Import. No Hassle.
> > > +1 to docs, though.
> > >
> > > On Fri, Oct 14, 2016 at 7:19 AM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > [Troll] Who's using Eclipse anymore ? [/Troll]
> > > >
> > > > ;)
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 10/14/2016 04:06 PM, Jesse Anderson wrote:
> > > >
> > > >> Last week I imported Beam with IntelliJ and everything worked.
> > > >>
> > > >> That said, I tried to import the Eclipse project and that doesn't
> > > compile
> > > >> anymore. I didn't have time to figure out what happened though.
> > > >>
> > > >> On Fri, Oct 14, 2016 at 1:21 AM Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > >> wrote:
> > > >>
> > > >> Hi Christian,
> > > >>>
> > > >>> IntelliJ doesn't need any special config (maybe the code style can
> be
> > > >>> documented or imported).
> > > >>>
> > > >>> Anyway, agree to add such on website in the contribute directory. I
> > > >>> think it could be part of the contribution-guide as it's first
> setup
> > > >>> step.
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>> On 10/14/2016 10:17 AM, Christian Schneider wrote:
> > > >>>
> > >  Hello all,
> > > 
> > >  I am new to the beam community and currently start making myself
> > >  familiar with the code.  I quickly found the contribution guide
> and
> > > was
> > >  able to clone the code and build beam using maven.
> > > 
> > >  The first obstacle I faced was getting the code build in eclipse.
> I
> > >  naively imported as existing maven projects but got lots of
> compile
> > >  errors. After talking to Dan Kulp we found that this is due to the
> > apt
> > >  annotation processing for auto value types. Dan explained me how I
> > > need
> > >  to setup eclipse to make it work.
> > > 
> > >  I still got 5 compile errors (Some bound mismatch at Read.bounded,
> > and
> > >  one ambiguous method empty). These errors seem to be present for
> > >  everyone using eclipse and Dan works on it. So I think this is
> not a
> > >  permanent problem.
> > > 
> > >  To make it easier for new people I would like to write a
> > documentation
> > >  about the IDE setup. I can cover the eclipse part but I think
> > intellij
> > >  should also be described.
> > > 
> > >  I already started with it and placed it in /contribute/ide-setup.
> > Does
> > >  that make sense?
> > > 
> > >  I currently did not link to it from anywhere. I think it should be
> > >  linked in the contribute/index and in the Contribute menu.
> > > 
> > >  Christian
> > > 
> > > 
> > > >>> --
> > > >>> Jean-Baptiste Onofré
> > > >>> jbono...@apache.org
> > > >>> http://blog.nanthrax.net
> > > >>> Talend - http://www.talend.com
> > > >>>
> > > >>>
> > > >>
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> > >
> > >
> > > --
> > > Neelesh Srinivas Salian
> > > Customer Operations Engineer
> > >
> >
>


Re: [DISCUSS] Introduce DoFnWithStore

2016-10-14 Thread Lukasz Cwik
The only way we have today is to use BoundedReadFromUnboundedSource or use
a side input to bridge an unbounded portion of the pipeline with a bounded
portion of the pipeline.
The model allows the side input bridge between these two portions of the
pipeline to happen but I can't comment as to how well it will work with the
runners we have today.
The bounded portion of the pipeline would need to know some set of windows
it wanted to wait for upfront from the unbounded portion so that the side
input trigger would fire correctly and allow the bounded portion of the
pipeline to be scheduled to execute.

On Fri, Oct 14, 2016 at 7:59 AM, Jean-Baptiste Onofré 
wrote:

> Thanks for the update Lukasz.
>
> How would you implement a "transform" from unbounded PCollection to
> bounded PCollection ?
>
> Even if I use a GroupByKey with something like KV, it
> doesn't change the type of the PCollection.
>
> You are right with State API. My proposal is more a way to implicitly use
> State in DoFn.
>
> Regards
> JB
>
>
> On 10/14/2016 04:51 PM, Lukasz Cwik wrote:
>
>> SplittableDoFn is about taking a single element and turning it into
>> potentially many in a parallel way by allowing an element to be split
>> across bundles.
>>
>> I believe a user could do what you describe by using a GBK to group their
>> data how they want. In your example it would be a single key, then they
>> would have KV for all the values when reading from that
>> GBK. The proposed State API seems to also overlap with what your trying to
>> achieve.
>>
>>
>>
>> On Fri, Oct 14, 2016 at 5:12 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi guys,
>>>
>>> When testing the different IOs, we want to have the best possible
>>> coverage
>>> and be able to test with different use cases.
>>>
>>> We create integration test pipelines, and, one "classic" use case is to
>>> implement a pipeline starting from an unbounded source (providing an
>>> unbounded PCollection like Kafka, JMS, MQTT, ...) and sending data to a
>>> bounded sink (TextIO for instance) expected a bounded PCollection.
>>>
>>> This use case is not currently possible. Even when using a Window, it
>>> will
>>> create a chunk of the unbounded PCollection, but the PCollection is still
>>> unbounded.
>>>
>>> That's why I created: https://issues.apache.org/jira/browse/BEAM-638.
>>>
>>> However, I don't think a Window Fn/Trigger is the best approach.
>>>
>>> A possible solution would be to create a specific IO
>>> (BoundedWriteFromUnboundedSourceIO similar to the one we have for Read
>>> ;)) to do that, but I think we should provide a more global way, as this
>>> use case is not specific to IO. For instance, a sorting PTransform will
>>> work only on a bounded PCollection (not an unbounded).
>>>
>>> I wonder if we could not provide a DoFnWithStore. The purpose is to store
>>> unbounded PCollection elements (squared by a Window for instance) into a
>>> pluggable store and read from the store to provide a bounded PCollection.
>>> The store/read trigger could be on the finish bundle.
>>> We could provide "store service", for instance based on GS, HDFS, or any
>>> other storage (Elasticsearch, Cassandra, ...).
>>>
>>> Spark users might be "confused", as in Spark, this behavior is "native"
>>> thanks to the micro-batches. In spark-streaming, basically a DStream is a
>>> bounded collection of RDDs.
>>>
>>> Basically, the DoFnWithStore will look like a DoFn with implicit
>>> store/read from the store. Something like:
>>>
>>> public abstract class DoFnWithStore extends DoFn {
>>>
>>>   @ProcessElement
>>>   @Store(Window)
>>>   
>>>
>>> }
>>>
>>> Generally, SDF sounds like a native way to let users implement this
>>> behavior explicitly.
>>>
>>> My proposal is to do it implicitly and transparently for the end users
>>> (they just have to provide the Window definition and the store service to
>>> use).
>>>
>>> Thoughts ?
>>>
>>> Regards
>>> JB
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Introduce DoFnWithStore

2016-10-14 Thread Jean-Baptiste Onofré

Thanks for the update Lukasz.

How would you implement a "transform" from unbounded PCollection to 
bounded PCollection ?


Even if I use a GroupByKey with something like KV, it 
doesn't change the type of the PCollection.


You are right with State API. My proposal is more a way to implicitly 
use State in DoFn.


Regards
JB

On 10/14/2016 04:51 PM, Lukasz Cwik wrote:

SplittableDoFn is about taking a single element and turning it into
potentially many in a parallel way by allowing an element to be split
across bundles.

I believe a user could do what you describe by using a GBK to group their
data how they want. In your example it would be a single key, then they
would have KV for all the values when reading from that
GBK. The proposed State API seems to also overlap with what your trying to
achieve.



On Fri, Oct 14, 2016 at 5:12 AM, Jean-Baptiste Onofré 
wrote:


Hi guys,

When testing the different IOs, we want to have the best possible coverage
and be able to test with different use cases.

We create integration test pipelines, and, one "classic" use case is to
implement a pipeline starting from an unbounded source (providing an
unbounded PCollection like Kafka, JMS, MQTT, ...) and sending data to a
bounded sink (TextIO for instance) expected a bounded PCollection.

This use case is not currently possible. Even when using a Window, it will
create a chunk of the unbounded PCollection, but the PCollection is still
unbounded.

That's why I created: https://issues.apache.org/jira/browse/BEAM-638.

However, I don't think a Window Fn/Trigger is the best approach.

A possible solution would be to create a specific IO
(BoundedWriteFromUnboundedSourceIO similar to the one we have for Read
;)) to do that, but I think we should provide a more global way, as this
use case is not specific to IO. For instance, a sorting PTransform will
work only on a bounded PCollection (not an unbounded).

I wonder if we could not provide a DoFnWithStore. The purpose is to store
unbounded PCollection elements (squared by a Window for instance) into a
pluggable store and read from the store to provide a bounded PCollection.
The store/read trigger could be on the finish bundle.
We could provide "store service", for instance based on GS, HDFS, or any
other storage (Elasticsearch, Cassandra, ...).

Spark users might be "confused", as in Spark, this behavior is "native"
thanks to the micro-batches. In spark-streaming, basically a DStream is a
bounded collection of RDDs.

Basically, the DoFnWithStore will look like a DoFn with implicit
store/read from the store. Something like:

public abstract class DoFnWithStore extends DoFn {

  @ProcessElement
  @Store(Window)
  

}

Generally, SDF sounds like a native way to let users implement this
behavior explicitly.

My proposal is to do it implicitly and transparently for the end users
(they just have to provide the Window definition and the store service to
use).

Thoughts ?

Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] Introduce DoFnWithStore

2016-10-14 Thread Lukasz Cwik
SplittableDoFn is about taking a single element and turning it into
potentially many in a parallel way by allowing an element to be split
across bundles.

I believe a user could do what you describe by using a GBK to group their
data how they want. In your example it would be a single key, then they
would have KV for all the values when reading from that
GBK. The proposed State API seems to also overlap with what your trying to
achieve.



On Fri, Oct 14, 2016 at 5:12 AM, Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> When testing the different IOs, we want to have the best possible coverage
> and be able to test with different use cases.
>
> We create integration test pipelines, and, one "classic" use case is to
> implement a pipeline starting from an unbounded source (providing an
> unbounded PCollection like Kafka, JMS, MQTT, ...) and sending data to a
> bounded sink (TextIO for instance) expected a bounded PCollection.
>
> This use case is not currently possible. Even when using a Window, it will
> create a chunk of the unbounded PCollection, but the PCollection is still
> unbounded.
>
> That's why I created: https://issues.apache.org/jira/browse/BEAM-638.
>
> However, I don't think a Window Fn/Trigger is the best approach.
>
> A possible solution would be to create a specific IO
> (BoundedWriteFromUnboundedSourceIO similar to the one we have for Read
> ;)) to do that, but I think we should provide a more global way, as this
> use case is not specific to IO. For instance, a sorting PTransform will
> work only on a bounded PCollection (not an unbounded).
>
> I wonder if we could not provide a DoFnWithStore. The purpose is to store
> unbounded PCollection elements (squared by a Window for instance) into a
> pluggable store and read from the store to provide a bounded PCollection.
> The store/read trigger could be on the finish bundle.
> We could provide "store service", for instance based on GS, HDFS, or any
> other storage (Elasticsearch, Cassandra, ...).
>
> Spark users might be "confused", as in Spark, this behavior is "native"
> thanks to the micro-batches. In spark-streaming, basically a DStream is a
> bounded collection of RDDs.
>
> Basically, the DoFnWithStore will look like a DoFn with implicit
> store/read from the store. Something like:
>
> public abstract class DoFnWithStore extends DoFn {
>
>   @ProcessElement
>   @Store(Window)
>   
>
> }
>
> Generally, SDF sounds like a native way to let users implement this
> behavior explicitly.
>
> My proposal is to do it implicitly and transparently for the end users
> (they just have to provide the Window definition and the store service to
> use).
>
> Thoughts ?
>
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Documentation for IDE setup

2016-10-14 Thread Jean-Baptiste Onofré

[Troll] Who's using Eclipse anymore ? [/Troll]

;)

Regards
JB

On 10/14/2016 04:06 PM, Jesse Anderson wrote:

Last week I imported Beam with IntelliJ and everything worked.

That said, I tried to import the Eclipse project and that doesn't compile
anymore. I didn't have time to figure out what happened though.

On Fri, Oct 14, 2016 at 1:21 AM Jean-Baptiste Onofré 
wrote:


Hi Christian,

IntelliJ doesn't need any special config (maybe the code style can be
documented or imported).

Anyway, agree to add such on website in the contribute directory. I
think it could be part of the contribution-guide as it's first setup step.

Regards
JB

On 10/14/2016 10:17 AM, Christian Schneider wrote:

Hello all,

I am new to the beam community and currently start making myself
familiar with the code.  I quickly found the contribution guide and was
able to clone the code and build beam using maven.

The first obstacle I faced was getting the code build in eclipse. I
naively imported as existing maven projects but got lots of compile
errors. After talking to Dan Kulp we found that this is due to the apt
annotation processing for auto value types. Dan explained me how I need
to setup eclipse to make it work.

I still got 5 compile errors (Some bound mismatch at Read.bounded, and
one ambiguous method empty). These errors seem to be present for
everyone using eclipse and Dan works on it. So I think this is not a
permanent problem.

To make it easier for new people I would like to write a documentation
about the IDE setup. I can cover the eclipse part but I think intellij
should also be described.

I already started with it and placed it in /contribute/ide-setup. Does
that make sense?

I currently did not link to it from anywhere. I think it should be
linked in the contribute/index and in the Contribute menu.

Christian



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Documentation for IDE setup

2016-10-14 Thread Jesse Anderson
Last week I imported Beam with IntelliJ and everything worked.

That said, I tried to import the Eclipse project and that doesn't compile
anymore. I didn't have time to figure out what happened though.

On Fri, Oct 14, 2016 at 1:21 AM Jean-Baptiste Onofré 
wrote:

> Hi Christian,
>
> IntelliJ doesn't need any special config (maybe the code style can be
> documented or imported).
>
> Anyway, agree to add such on website in the contribute directory. I
> think it could be part of the contribution-guide as it's first setup step.
>
> Regards
> JB
>
> On 10/14/2016 10:17 AM, Christian Schneider wrote:
> > Hello all,
> >
> > I am new to the beam community and currently start making myself
> > familiar with the code.  I quickly found the contribution guide and was
> > able to clone the code and build beam using maven.
> >
> > The first obstacle I faced was getting the code build in eclipse. I
> > naively imported as existing maven projects but got lots of compile
> > errors. After talking to Dan Kulp we found that this is due to the apt
> > annotation processing for auto value types. Dan explained me how I need
> > to setup eclipse to make it work.
> >
> > I still got 5 compile errors (Some bound mismatch at Read.bounded, and
> > one ambiguous method empty). These errors seem to be present for
> > everyone using eclipse and Dan works on it. So I think this is not a
> > permanent problem.
> >
> > To make it easier for new people I would like to write a documentation
> > about the IDE setup. I can cover the eclipse part but I think intellij
> > should also be described.
> >
> > I already started with it and placed it in /contribute/ide-setup. Does
> > that make sense?
> >
> > I currently did not link to it from anywhere. I think it should be
> > linked in the contribute/index and in the Contribute menu.
> >
> > Christian
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] New Beam website design?

2016-10-14 Thread Jean-Baptiste Onofré

Hi James,

just to let you know that I did good progress on the website mockup.

I should be able to propose a PR very soon.

Thanks for your patience ;)

Regards
JB

On 06/06/2016 05:29 PM, James Malone wrote:

Hello everyone!

The current design of the Apache Beam website[1] is based on the a basic
Bootstrap/Jekyll theme. While this made getting an initial site out quickly
pretty easy, the site itself is a little bland (in my opinion :). I propose
we create a new design (layout templates, color schemes, visual design) for
the Beam website.

Since the website is currently using Bootstrap and Jekyll, this should be a
relatively easy process. Getting this done will require a new design and
some CSS/HTML work. Additionally, before a design is put in place, I think
it makes sense to discuss any ideas about a future design first.

So, I think there are two open questions behind this proposal:

1. Is there anyone within the community who would be interested in creating
a design proposal or two and sharing them with the community?
2. Are there any ideas, opinions, and thoughts around what the design of
the site *should* be?

What does everyone think?

Cheers!

James

[1]: http://beam.incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-14 Thread Jean-Baptiste Onofré

Hi guys,

I think we agreed on most of the points. We also agreed that points 4 & 
5 should be a best effort and not "enforced".


If there's no objection, I will create the review mailing list and 
update the github integration configuration.


Thanks all for your comments and feebacks !
Regards
JB

On 10/06/2016 01:53 PM, Jean-Baptiste Onofré wrote:

Hi team,

following the discussion we had about technical discussion that should
happen on the mailing list, I would like to propose the following:

1. We create a new mailing list: rev...@beam.incubator.apache.org.
2. We configure github integration to send all pull request comments on
review mailing list. It would allow to track and simplify the way to
read the comments and to keep up to date.
3. A technical discussion should be send on dev mailing list with the
[DISCUSS] keyword in the subject.
4. Once a discussion is open, the author should periodically send an
update on the discussion (once a week) containing a summary of the last
exchanges happened on the Jira or github (quick and direct summary).
5. Once we consider the discussion close (no update in the last two
weeks), the author send a [CLOSE] e-mail on the thread.

WDYT ?

Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Documentation for IDE setup

2016-10-14 Thread Christian Schneider

Hello all,

I am new to the beam community and currently start making myself 
familiar with the code.  I quickly found the contribution guide and was 
able to clone the code and build beam using maven.


The first obstacle I faced was getting the code build in eclipse. I 
naively imported as existing maven projects but got lots of compile 
errors. After talking to Dan Kulp we found that this is due to the apt 
annotation processing for auto value types. Dan explained me how I need 
to setup eclipse to make it work.


I still got 5 compile errors (Some bound mismatch at Read.bounded, and 
one ambiguous method empty). These errors seem to be present for 
everyone using eclipse and Dan works on it. So I think this is not a 
permanent problem.


To make it easier for new people I would like to write a documentation 
about the IDE setup. I can cover the eclipse part but I think intellij 
should also be described.


I already started with it and placed it in /contribute/ide-setup. Does 
that make sense?


I currently did not link to it from anywhere. I think it should be 
linked in the contribute/index and in the Contribute menu.


Christian

--
Christian Schneider
http://www.liquid-reality.de

Open Source Architect
http://www.talend.com