Re: Move mock classes out of test directory in BeamSQL

2018-10-03 Thread Rui Wang
In #6566 , I moved mock classes
to sdk/extensions/sql/meta/provider/test
,
which is in BeamSQL's src/main and has already have this test directory to
save test table provider. I marked every mock class @Experimental as well.

Besides the reason that Andrew mentioned, another reason to move mock
classes to BeamSQL src/main is, usually if a module needs to use these
table mock classes, it will also need to depend on BeamSQL core directly.

-Rui



On Wed, Oct 3, 2018 at 10:37 AM Rui Wang  wrote:

> Thanks. Looks like at least moving mock is an accepted idea. I will come
> up a moving plan later (either to separate or to src/main, no matter what
> makes sense) and share it with you.
>
>
> -Rui
>
> On Wed, Oct 3, 2018 at 8:15 AM Andrew Pilloud  wrote:
>
>> The sql module's tests depend on mocks and mocks depend on sql module, so
>> moving this to a separate module creates a weird dependency graph. I don't
>> think it is strictly circular but it comes close. Can we just move the
>> folder from 'src/test' to 'src/main' and mark everything @Experimental?
>>
>> Andrew
>>
>> On Wed, Oct 3, 2018 at 2:28 AM Kai Jiang  wrote:
>>
>>> Big +1.
>>>
>>> Best,
>>> Kai
>>> ᐧ
>>>
>>> On Mon, Oct 1, 2018 at 10:42 PM Jean-Baptiste Onofré 
>>> wrote:
>>>
 +1

 it makes sense.

 Regards
 JB

 On 02/10/2018 01:32, Rui Wang wrote:
 > Hi Community,
 >
 > BeamSQL defines some mock classes (see mock
 > <
 https://github.com/apache/beam/tree/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/mock
 >)
 > in a test directory. As there is more than one module under sql
 >  now,
 > there is a need to share these mock classes among modules.
 >
 > So I want to move these mock classes to a separate module under sql
 > ,
 > so other modules' tests can depend on this mock module.
 >
 >
 > What do you think of this idea?
 >
 >
 > -Rui
 >
 >

 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

>>>


Re: Python 3: final step

2018-10-03 Thread Manu Zhang
Thanks Valentyn. Note some test failing issues are covered by “Finish Python 3 
porting for *** module”, e.g. https://issues.apache.org/jira/browse/BEAM-5315.

Manu
在 2018年10月3日 +0800 PM4:18,Valentyn Tymofieiev ,写道:
> Hi Rakesh and Manu,
>
> Thanks to both of you for offering help (in different threads). It's great to 
> see that more and more people get involved with helping to make Beam Python 3 
> compatible!
>
> There are a few PRs in flight, and several people in the community actively 
> work on Python 3 support now. I would be happy to coordinate the work so that 
> we don't step at each others toes and avoid duplication of effort.
>
> I recently looked at unit tests that are still failing in Python 3 
> environment  and filed a few issues (within range BEAM-5615 - BEAM-5629), to 
> track similar classes of errors. You can also find them on Kanban board [1].
> In particular, BEAM-5620 and BEAM-5627 should be easy issues to get started.
>
> There are multiple ways you can help:
> - Helping to rootcause errors. Even a comment why a test is failing and a 
> suggestion how to fix it, will be helpful for others when you don't have time 
> to do the fix.
> - Helping with code reviews.
> - Reporting new issues (as subtasks to BEAM-1251), deduplicating or splitting 
> the existing issues. We probably don't want to file a Jira for each of 250+ 
> currently failing tests at this point, but it may make sense to track the 
> errors that occur repeatedly share the root cause.
> - Fixing the issues. Feel free to assign an issue to yourself if you have a 
> fix in mind and plan to actively work on it. Due to the nature of the problem 
> it may occasionally happen that two issues share the rootcause, or fixing one 
> issue is a prerequisite for fixing another issue, so sync to master often to 
> make sure the issue you are working on is not already fixed.
>
> I'll also keep an eye on the PRs and will try to keep the list of open issues 
> up to date.
>
> Thanks,
> Valentyn
>
> [1]: 
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail
>
>
> > On Tue, Oct 2, 2018 at 9:38 AM Pablo Estrada  wrote:
> > > Very cool : ) I'm also available to review / merge if you need help from 
> > > my side.
> > > Best
> > > -P.
> > >
> > > > On Tue, Oct 2, 2018 at 7:45 AM Rakesh Kumar  
> > > > wrote:
> > > > > Hi Rob,
> > > > >
> > > > > I am, Rakesh Kumar, using Beam SDK for one of my projects at Lyft. I 
> > > > > have been working closely with Thomas Weise. I have already met a 
> > > > > couple of Python SDK developers in person.
> > > > > I am interested to help migrate to Python 3. You can assign me PRs 
> > > > > for review. I am also more than happy to take a simple ticket to 
> > > > > begin development work on Beam.
> > > > >
> > > > > Thank you,
> > > > > Rakesh
> > > > >
> > > > > > On Wed, Sep 5, 2018 at 9:12 AM Robbe Sneyders 
> > > > > >  wrote:
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > With the merging of [1], we now have Python 3 tests running on 
> > > > > > > Jenkins, which allows us to move forward with the last step of 
> > > > > > > the Python 3 porting.
> > > > > > >
> > > > > > > You can follow the progress on the Jira Kanban Board [2]. If 
> > > > > > > you're interested in helping by porting a module, you can assign 
> > > > > > > one of the issues to yourself and start coding. You can find the 
> > > > > > > different steps outlined in the design document [3].
> > > > > > >
> > > > > > > We could also use some extra reviewers. If you're interested, let 
> > > > > > > us know, and we'll tag you in our PRs.
> > > > > > >
> > > > > > > [1] https://github.com/apache/beam/pull/6266
> > > > > > > [2] 
> > > > > > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245
> > > > > > > [3] https://s.apache.org/beam-python-3
> > > > > > >
> > > > > > > kind regards,
> > > > > > > Robbe
> > > > > > > --
> > > > > > >
> > > > > > >
> > > > > > > Robbe Sneyders
> > > > > > > ML6 Gent
> > > > > > > M: +32 474 71 31 08
> > > > > --
> > > > > Rakesh Kumar
> > > > > Software Engineer
> > > > > 510-761-1364 |
> > > > >


Re: Does anyone have a strong intelliJ setup?

2018-10-03 Thread Thomas Weise
Current content on CWiki is outdated and needs to be replaced.

+1 for moving the instructions there (delete from website)

On Wed, Oct 3, 2018 at 8:04 PM Mikhail Gryzykhin <
gryzykhin.mikh...@gmail.com> wrote:

> @Scott Wegner 
>
> Would be really great if we can get good hints. However I would suggest to
> update corresponding page on cwiki, not website. It will be easier to
> maintain that one up-to-date. Some of tips already present there.
>
> https://cwiki.apache.org/confluence/display/BEAM/IntelliJ+Tips
>
> Regards,
> --Mikhail
>
> On Wed, Oct 3, 2018 at 4:30 PM Scott Wegner  wrote:
>
>> At ApacheCon I heard from a number of people that the IntelliJ setup
>> isn't as good as it used to be with Maven. Bad tooling makes me sad and I
>> want to make it better  :(
>>
>> It seems everyone has their own magic to get things working. If we got
>> these tips added to the website [1], do you think we'd be in good shape? If
>> not, I'd love to help out. Perhaps we could have a mini-hackathon on
>> improving IntelliJ configuration? Let me know what you think and if you're
>> interested in helping.
>>
>> [1]
>> https://github.com/apache/beam-site/blob/asf-site/src/contribute/intellij.md
>>
>> On Mon, Oct 1, 2018 at 2:32 PM Romain Manni-Bucau 
>> wrote:
>>
>>> Personally i drop all caches - idea + ivy + maven beam folder, build in
>>> console skipping test execution - important cause idea is not able to
>>> import the project without a correctly ran gradle setup and a failure can
>>> corrupt later imports, then I kill gradle daemon and finally import beam in
>>> idea using the wrapper.
>>>
>>> As it has been mentionned you will have to run tests using gradle
>>> wrapper due to current gradle setup which slows down a lot the execution
>>> compared to native idea one but at least it will run and you can debug
>>> normally.
>>>
>>> Le lun. 1 oct. 2018 22:38, Kenneth Knowles  a écrit :
>>>
 We have some hints in the gradle files that used to allow a smooth
 import with no extra steps*. Have the hints gotten out of date or are there
 new hints we can put in that might help?

 Kenn

 *anyhow at least for a week or two for a couple of people :-)

 On Mon, Oct 1, 2018 at 1:26 PM Ismaël Mejía  wrote:

> Hello Alex,
>
> I understand your pain and thanks for bringing this subject, I also
> have found many issues in the process to the point of believing
> recently that it is undeterministic.
> Last time I followed the process ~3 weeks ago. I had to clean up all
> caches (both remove the intelliJ temp files and the gradle cache
> files) and also I had to refresh the project in IntelliJ's gradle tool
> windows view after the initial import at least 2 times until it
> finally worked. Also remember that 2018.2 is not supported as reported
> by Ryan some weeks ago (not sure if already fixed).
>
> Probably there was something corrupted in my setup but I have heard
> similar stories of at least 2 more people.
> I really don't know how we can improve the current status quo apart of
> contacting the IntelliJ guys but I am concerned on how this can be an
> issue for new contributors.
>
> On Mon, Oct 1, 2018 at 8:47 PM Rui Wang  wrote:
> >
> > Hi Alex,
> >
> > I had troubles when importing JAVA SDK to intellij at the beginning.
> >
> > Besides what the instruction says, some extra steps that might help:
> > 1. Preferences/Settings > Build, Execution, Deployment > Build Tools
> > Gradle > Runner, choose Gradle Test Runner in the dropdown menu.
> > 2. Enable annotation processor.
> >
> > -Rui
> >
> > On Mon, Oct 1, 2018 at 11:33 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >>
> >> Hi Alex,
> >>
> >> After a git clean -fdx (removing all IDEA resources), I just open
> the
> >> folder in IntelliJ and it imports the project.
> >>
> >> It works fine so far (NB: I don't build using IntelliJ, it's mostly
> an
> >> editor for me, I use the command line for any other stuff like git,
> >> gradle, ...).
> >>
> >> Regards
> >> JB
> >>
> >> On 01/10/2018 20:05, Alex Amato wrote:
> >> > Hello,
> >> >
> >> > I'm looking to get a good intellij setup working and then update
> the
> >> > documentation how to build and test the java SDK with intelliJ.
> >> >
> >> > Does anyone have a good setup working, with some tips? I followed
> our
> >> > instructions here, but I found that after following these steps I
> could
> >> > not build or test the project. It seemed like the build button did
> >> > nothing and the test buttons did not appear.
> >> > https://beam.apache.org/contribute/intellij/
> >> >
> >> > I'm also curious about the gradle support for generating intelliJ
> >> > projects. Has anyone tried this as well?
> >> >
> >> > Any tips would 

Re: Does anyone have a strong intelliJ setup?

2018-10-03 Thread Mikhail Gryzykhin
@Scott Wegner 

Would be really great if we can get good hints. However I would suggest to
update corresponding page on cwiki, not website. It will be easier to
maintain that one up-to-date. Some of tips already present there.

https://cwiki.apache.org/confluence/display/BEAM/IntelliJ+Tips

Regards,
--Mikhail

On Wed, Oct 3, 2018 at 4:30 PM Scott Wegner  wrote:

> At ApacheCon I heard from a number of people that the IntelliJ setup isn't
> as good as it used to be with Maven. Bad tooling makes me sad and I want to
> make it better  :(
>
> It seems everyone has their own magic to get things working. If we got
> these tips added to the website [1], do you think we'd be in good shape? If
> not, I'd love to help out. Perhaps we could have a mini-hackathon on
> improving IntelliJ configuration? Let me know what you think and if you're
> interested in helping.
>
> [1]
> https://github.com/apache/beam-site/blob/asf-site/src/contribute/intellij.md
>
> On Mon, Oct 1, 2018 at 2:32 PM Romain Manni-Bucau 
> wrote:
>
>> Personally i drop all caches - idea + ivy + maven beam folder, build in
>> console skipping test execution - important cause idea is not able to
>> import the project without a correctly ran gradle setup and a failure can
>> corrupt later imports, then I kill gradle daemon and finally import beam in
>> idea using the wrapper.
>>
>> As it has been mentionned you will have to run tests using gradle wrapper
>> due to current gradle setup which slows down a lot the execution compared
>> to native idea one but at least it will run and you can debug normally.
>>
>> Le lun. 1 oct. 2018 22:38, Kenneth Knowles  a écrit :
>>
>>> We have some hints in the gradle files that used to allow a smooth
>>> import with no extra steps*. Have the hints gotten out of date or are there
>>> new hints we can put in that might help?
>>>
>>> Kenn
>>>
>>> *anyhow at least for a week or two for a couple of people :-)
>>>
>>> On Mon, Oct 1, 2018 at 1:26 PM Ismaël Mejía  wrote:
>>>
 Hello Alex,

 I understand your pain and thanks for bringing this subject, I also
 have found many issues in the process to the point of believing
 recently that it is undeterministic.
 Last time I followed the process ~3 weeks ago. I had to clean up all
 caches (both remove the intelliJ temp files and the gradle cache
 files) and also I had to refresh the project in IntelliJ's gradle tool
 windows view after the initial import at least 2 times until it
 finally worked. Also remember that 2018.2 is not supported as reported
 by Ryan some weeks ago (not sure if already fixed).

 Probably there was something corrupted in my setup but I have heard
 similar stories of at least 2 more people.
 I really don't know how we can improve the current status quo apart of
 contacting the IntelliJ guys but I am concerned on how this can be an
 issue for new contributors.

 On Mon, Oct 1, 2018 at 8:47 PM Rui Wang  wrote:
 >
 > Hi Alex,
 >
 > I had troubles when importing JAVA SDK to intellij at the beginning.
 >
 > Besides what the instruction says, some extra steps that might help:
 > 1. Preferences/Settings > Build, Execution, Deployment > Build Tools
 > Gradle > Runner, choose Gradle Test Runner in the dropdown menu.
 > 2. Enable annotation processor.
 >
 > -Rui
 >
 > On Mon, Oct 1, 2018 at 11:33 AM Jean-Baptiste Onofré 
 wrote:
 >>
 >> Hi Alex,
 >>
 >> After a git clean -fdx (removing all IDEA resources), I just open the
 >> folder in IntelliJ and it imports the project.
 >>
 >> It works fine so far (NB: I don't build using IntelliJ, it's mostly
 an
 >> editor for me, I use the command line for any other stuff like git,
 >> gradle, ...).
 >>
 >> Regards
 >> JB
 >>
 >> On 01/10/2018 20:05, Alex Amato wrote:
 >> > Hello,
 >> >
 >> > I'm looking to get a good intellij setup working and then update
 the
 >> > documentation how to build and test the java SDK with intelliJ.
 >> >
 >> > Does anyone have a good setup working, with some tips? I followed
 our
 >> > instructions here, but I found that after following these steps I
 could
 >> > not build or test the project. It seemed like the build button did
 >> > nothing and the test buttons did not appear.
 >> > https://beam.apache.org/contribute/intellij/
 >> >
 >> > I'm also curious about the gradle support for generating intelliJ
 >> > projects. Has anyone tried this as well?
 >> >
 >> > Any tips would be appreciated.
 >> > Thank you,
 >> > Alex
 >>
 >> --
 >> Jean-Baptiste Onofré
 >> jbono...@apache.org
 >> http://blog.nanthrax.net
 >> Talend - http://www.talend.com

>>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: Is Splittable DoFn suitable for fetch data from a socket server?

2018-10-03 Thread flyisland
Hi Raghu,

> Assuming you need to ack on the same connection that served the records,
finalize() functionality in UnboundedSource API is important case. You can
use UnboundeSource API for now.

I have got a new question now, where should I keep the connection for later
ack action?

The MqttIO/JmsIO all acked messages in
the UnboundedSource.CheckpointMark.finalizeCheckpoint() method, but I found
in the javadoc it said:

>  It is NOT safe to assume the UnboundedSource.UnboundedReader from which
this checkpoint was created still exists at the time this method is called.

I do encounter this situation in my testing with the Direct Runner, the
"msg.ack()" method failed when the finalizeCheckpoint() method is called
since the related reader has already been closed!

Is there any way to ask the runner to call finalizeCheckpoint() method
before it closed the Reader?


On Sat, Sep 22, 2018 at 7:01 AM Raghu Angadi  wrote:

> > This in-house built socket server could accept multiple clients, but
> only send messages to the first-connected client, and will send messages to
> the second client if the first one disconnected.
>
> Server sending messages to first client connection only is quite critical.
> Even if you use Source API which honors 'Setup()' JavaDoc, it is not enough
> in your case. Note that is says it reuses, but that does not guarantee
> single DoFn instance or when it actually calls TearDown(). It is on
> best-effort basis. The work could move to a different worker and the DoFn
> instance on earlier worker can live for a long time. So if you held the
> connection to server until TearDown() is called, you could be inadvertently
> blocking reads from DoFn on the new worker. If you want to keep the
> connection open across bundles, you need some way to close an idle
> connection asynchronously (alternately your service might have timeout to
> close an idle client connection, which is much better). Since you can't
> afford to wait till TearDown(), you might as well have a singleton
> connection that gets closed after some idle time.
>
> Assuming you need to ack on the same connection that served the records,
> finalize() functionality in UnboundedSource API is important case. You can
> use UnboundeSource API for now.
>
> On Thu, Sep 20, 2018 at 8:25 PM flyisland  wrote:
>
>> Hi Reuven,
>>
>> There is no explicit ID in the message itself, and if there is
>> information can be used as an ID is depend on use cases.
>>
>> On Fri, Sep 21, 2018 at 11:05 AM Reuven Lax  wrote:
>>
>>> Is there information in the message that can be used as an id, that can
>>> be used for deduplication?
>>>
>>> On Thu, Sep 20, 2018 at 6:36 PM flyisland  wrote:
>>>
 Hi Lukasz,

 With the current API we provided, messages cannot be acked from a
 different client.

 The server will re-send messages to the reconnected client if those
 messages were not acked. So there'll be duplicate messages, but with a
 "redeliver times" property in the header.

 Thanks for your helpful information, I'll check the UnboundedSources,
 thanks!



 On Fri, Sep 21, 2018 at 2:09 AM Lukasz Cwik  wrote:

> Are duplicate messages ok?
>
> Can you ack messages from a different client or are messages sticky to
> a single client (e.g. if one client loses connection, when it reconnects
> can it ack messages it received or are those messages automatically
> replayed)?
>
> UnboundedSources are the only current "source" type that supports
> finalization callbacks[1] that you will need to ack messages and
> deduplication[2]. SplittableDoFn will support both of these features but
> are not there yet.
>
> 1:
> https://github.com/apache/beam/blob/256fcdfe3ab0f6827195a262932e1267c4d4bfba/sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java#L129
> 2:
> https://github.com/apache/beam/blob/256fcdfe3ab0f6827195a262932e1267c4d4bfba/sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java#L93
>
>
> On Wed, Sep 19, 2018 at 8:31 PM flyisland 
> wrote:
>
>> Hi Lukasz,
>>
>> This socket server is like an MQTT server, it has queues inside it
>> and the client should ack each message.
>>
>> > Is receiving and processing each message reliably important or is
>> it ok to drop messages when things fail?
>> A: Reliable is important, no messages should be lost.
>>
>> > Is there a message acknowledgement system or can you request a
>> position within the message stream (e.g. send all messages from position 
>> X
>> when connecting and if for whatever reason you need to reconnect you can
>> say send messages from position X to replay past messages)?
>> A: Client should ack each message it received, and the server will
>> delete the acked message. If the client broked and the server do not
>> receive an ack, the server will re-send the message to 

Can we allow SimpleFunction and SerializableFunction to throw Exception?

2018-10-03 Thread Jeff Klukas
I'm working on https://issues.apache.org/jira/browse/BEAM-5638 to add
exception handling options to single message transforms in the Java SDK.

MapElements' via() method is overloaded to accept either a SimpleFunction,
a SerializableFunction, or a Contextful, all of which are ultimately stored
as a Contextful where the mapping functionis expected to have signature:

OutputT apply(InputT element, Context c) throws Exception;

So Contextful.Fn allows throwing checked exceptions, but neither
SerializableFunction nor SimpleFunction do. The user-provided function has
to satisfy the more restrictive signature:

OutputT apply(InputT input);

Is there background about why we allow arbitrary checked exceptions to be
thrown in one case but not the other two? Could we consider expanding
SerializableFunction and SimpleFunction to the following?:

OutputT apply(InputT input) throws Exception;

This would, for example, simplify the implementation of ParseJsons and
AsJsons, where we have to catch an IOException in MapElements#via only to
rethrow as RuntimeException.


Re: [Proposal] Add exception handling option to MapElements

2018-10-03 Thread Jeff Klukas
Jira issues for adding exception handling in Java and Python SDKs:

https://issues.apache.org/jira/browse/BEAM-5638
https://issues.apache.org/jira/browse/BEAM-5639

I'll plan to have a complete PR for the Java SDK put together in the next
few days.

On Wed, Oct 3, 2018 at 1:29 PM Jeff Klukas  wrote:

> I don't personally have experience with the Python SDK, so am not
> immediately in a position to comment on how feasible it would be to
> introduce a similar change there. I'll plan to write up two separate issues
> for adding exception handling in the Java and Python SDKs.
>
> On Wed, Oct 3, 2018 at 12:17 PM Thomas Weise  wrote:
>
>> +1 for the proposal as well as the suggestion to offer it in other SDKs,
>> where applicable
>>
>> On Wed, Oct 3, 2018 at 8:58 AM Chamikara Jayalath 
>> wrote:
>>
>>> Sounds like a very good addition. I'd say this can be a single PR since
>>> changes are related. Please open a JIRA for tracking.
>>>
>>> Have you though about introducing a similar change to Python SDK ?
>>> (doesn't have to be the same PR).
>>>
>>> - Cham
>>>
>>> On Wed, Oct 3, 2018 at 8:31 AM Jeff Klukas  wrote:
>>>
 If this looks good for MapElements, I agree that it makes sense to
 extend to FlatMapElements and Filter and to keep the API consistent between
 them.

 Do you have suggestions on how to submit changes with that wider scope?
 Would one PR altering MapElements, FlatMapElements, Filter, ParseJsons, and
 AsJsons be too large to reasonably review? Should I open an overall JIRA
 ticket to track and break this into smaller  PRs?

 On Wed, Oct 3, 2018 at 10:31 AM Reuven Lax  wrote:

> Sounds cool. Why not support this on other transforms as well?
> (FlatMapElements, Filter, etc.)
>
> Reuven
>
> On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas 
> wrote:
>
>> I've seen a few Beam users mention the need to handle errors in their
>> transforms by using a try/catch and routing to different outputs based on
>> whether an exception was thrown. This was particularly nicely written up 
>> in
>> a post by Vallery Lancey:
>>
>>
>> https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
>>
>> I'd love to see this pattern better supported directly in the Beam
>> API, because it currently requires the user to implement a full DoFn even
>> for the simplest cases.
>>
>> I propose we support for a MapElements-like transform that allows the
>> user to specify a set of exceptions to catch and route to a failure 
>> output.
>> Something like:
>>
>> MapElements
>> .via(myFunctionThatThrows)
>> .withSuccessTag(successTag)
>> .withFailureTag(failureTag, JsonParsingException.class)
>>
>> which would output a PCollectionTuple with both the successful
>> outcomes of the map operation and also a collection of the inputs that
>> threw JsonParsingException.
>>
>> To make this more concrete, I put together a proof of concept PR:
>> https://github.com/apache/beam/pull/6518  I'd appreciate feedback
>> about whether this seems like a worthwhile addition and a feasible 
>> approach.
>>
>


Re: Metrics Pusher support on Dataflow

2018-10-03 Thread Scott Wegner
Another point that we discussed at ApacheCon is that a difference between
Dataflow and other runners is Dataflow is service-based and doesn't need a
locally executing "driver" program. A local driver context is a good place
to implement MetricsPusher because it is a singleton process.

In fact, DataflowRunner supports PipelineResult.waitUntilFinish() [1],
where we do maintain the local JVM context. Currently in this mode the
runner polls the Dataflow service API for log messages [2]. It would be
very easy to also poll for metric updates and push them out via
MetricsPusher.

[1]
https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L169
[2]
https://github.com/apache/beam/blob/279a05604b83a54e8e5a79e13d8761f94841f326/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L291

On Wed, Oct 3, 2018 at 4:44 AM Etienne Chauchot 
wrote:

> Hi Scott,
> Thanks for the update.
> Both solutions look good to me. Though, they both have plus and minus. I
> let the googlers chose which is more appropriate:
>
> - DAG modifcation: less intrusive in Dataflow but the DAG executed and
> shown in the DAG UI in dataflow will contain an extra step that the user
> might wonder about.
> - polling thread: it is exactly what I did for the other runners, it is
> more transparent to the user but requires more infra work (adds a container
> that needs to be resilient)
>
> Best
> Etienne
>
> Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
>
> Hi Etienne, sorry for the delay on this. I just got back from leave and
> found this discussion.
>
> We haven't started implementing MetricsPusher in the Dataflow runner,
> mostly because the Dataflow service has it's own rich Metrics REST API and
> we haven't heard a need from Dataflow customers to push metrics to an
> external backend. However, it would be nice to have this implemented across
> all runners for feature parity.
>
> I read through the discussion in JIRA [1], and the simplest implementation
> for Dataflow may be to have a single thread periodically poll the Dataflow
> REST API [2] for latest metric values, and push them to a configured sink.
> This polling thread could be hosted in a separate docker container, within
> the worker process, or perhaps a ParDo with timers that gets injected into
> the pipeline during graph translation.
>
> At any rate, I'm not aware of anybody currently working on this. But with
> the Dataflow worker code being donated to Beam [3], soon it will be
> possible for anybody to contribute.
>
> [1] https://issues.apache.org/jira/browse/BEAM-3926
> [2]
> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
> [3]
> https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.org%3E
>
> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik  wrote:
>
> I forwarded your request to a few people who work on the internal parts of
> Dataflow to see if they could help in some way.
>
> On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot 
> wrote:
>
> Hi all
>
> As we already discussed, it would be good to support Metrics Pusher [1] in
> Dataflow (in other runners also, of course). Today, only Spark and Flink
> support it. It requires a modification in C++ Dataflow code, so only Google
> friends can do it.
>
> Is someone interested in doing it ?
>
> Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
>
> Besides, I wonder if this feature should be added to the capability matrix.
>
> [1]
> https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
>
> Thanks
> Etienne
>
>
>
>

-- 




Got feedback? tinyurl.com/swegner-feedback


Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-03 Thread Ahmet Altay
Great. I will do the cut on 10/10.

Let's start by triaging the open issues targeted for 2.8.0 [1]. If you have
any issues in this list please resolve them or move to the next release. If
you are aware of any critical issues please add to this list.

Ahmet

[1]
https://issues.apache.org/jira/browse/BEAM-5456?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.8.0%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC

> +1 for the 2.7.0 release schedule. Thanks for volunteering. Do we want a
standing owner for the LTS branch (like the Linux kernel has) or will we
just take volunteers for each LTS release as they arise?

We have not thought about this before. IMO, it is better to keep things
simple and use the same process (i.e. "we just take volunteers for each LTS
release as they arise") for patch releases in the future if/when we happen
to need those.


On Wed, Oct 3, 2018 at 1:21 PM, Thomas Weise  wrote:

> +1
>
> On Wed, Oct 3, 2018 at 12:33 PM Ted Yu  wrote:
>
>> +1
>>
>> On Wed, Oct 3, 2018 at 9:52 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1
>>>
>>> but we have to be fast in release process. 2.7.0 took more than 1 month
>>> to be cut !
>>>
>>> If no blocker, we have to just move forward.
>>>
>>
+1


>
>>> Regards
>>> JB
>>>
>>> On 03/10/2018 18:25, Ahmet Altay wrote:
>>> > Hi all,
>>> >
>>> > Release cut date for the next release is 10/10 according to Beam
>>> release
>>> > calendar [1]. Since the previous release is already mostly wrapped up
>>> > (modulo blog post), I would like to propose starting the next release
>>> on
>>> > time (10/10).
>>> >
>>> > Additionally I propose designating this release as the first
>>> > long-term-support (LTS) release [2]. This should have no impact on the
>>> > release process, however it would mean that we commit to patch this
>>> > release for the next 12 months for major issues.
>>> >
>>> > I volunteer to perform this release.
>>> >
>>> > What do you think?
>>> >
>>> > Ahmet
>>> >
>>> > [1] https://calendar.google.com/calendar/embed?src=
>>> 0p73sl034k80oob7seouanigd0%40group.calendar.google.com&
>>> ctz=America%2FLos_Angeles
>>> > [2] https://beam.apache.org/community/policies/#releases
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>


Re: Does anyone have a strong intelliJ setup?

2018-10-03 Thread Scott Wegner
At ApacheCon I heard from a number of people that the IntelliJ setup isn't
as good as it used to be with Maven. Bad tooling makes me sad and I want to
make it better  :(

It seems everyone has their own magic to get things working. If we got
these tips added to the website [1], do you think we'd be in good shape? If
not, I'd love to help out. Perhaps we could have a mini-hackathon on
improving IntelliJ configuration? Let me know what you think and if you're
interested in helping.

[1]
https://github.com/apache/beam-site/blob/asf-site/src/contribute/intellij.md

On Mon, Oct 1, 2018 at 2:32 PM Romain Manni-Bucau 
wrote:

> Personally i drop all caches - idea + ivy + maven beam folder, build in
> console skipping test execution - important cause idea is not able to
> import the project without a correctly ran gradle setup and a failure can
> corrupt later imports, then I kill gradle daemon and finally import beam in
> idea using the wrapper.
>
> As it has been mentionned you will have to run tests using gradle wrapper
> due to current gradle setup which slows down a lot the execution compared
> to native idea one but at least it will run and you can debug normally.
>
> Le lun. 1 oct. 2018 22:38, Kenneth Knowles  a écrit :
>
>> We have some hints in the gradle files that used to allow a smooth import
>> with no extra steps*. Have the hints gotten out of date or are there new
>> hints we can put in that might help?
>>
>> Kenn
>>
>> *anyhow at least for a week or two for a couple of people :-)
>>
>> On Mon, Oct 1, 2018 at 1:26 PM Ismaël Mejía  wrote:
>>
>>> Hello Alex,
>>>
>>> I understand your pain and thanks for bringing this subject, I also
>>> have found many issues in the process to the point of believing
>>> recently that it is undeterministic.
>>> Last time I followed the process ~3 weeks ago. I had to clean up all
>>> caches (both remove the intelliJ temp files and the gradle cache
>>> files) and also I had to refresh the project in IntelliJ's gradle tool
>>> windows view after the initial import at least 2 times until it
>>> finally worked. Also remember that 2018.2 is not supported as reported
>>> by Ryan some weeks ago (not sure if already fixed).
>>>
>>> Probably there was something corrupted in my setup but I have heard
>>> similar stories of at least 2 more people.
>>> I really don't know how we can improve the current status quo apart of
>>> contacting the IntelliJ guys but I am concerned on how this can be an
>>> issue for new contributors.
>>>
>>> On Mon, Oct 1, 2018 at 8:47 PM Rui Wang  wrote:
>>> >
>>> > Hi Alex,
>>> >
>>> > I had troubles when importing JAVA SDK to intellij at the beginning.
>>> >
>>> > Besides what the instruction says, some extra steps that might help:
>>> > 1. Preferences/Settings > Build, Execution, Deployment > Build Tools >
>>> Gradle > Runner, choose Gradle Test Runner in the dropdown menu.
>>> > 2. Enable annotation processor.
>>> >
>>> > -Rui
>>> >
>>> > On Mon, Oct 1, 2018 at 11:33 AM Jean-Baptiste Onofré 
>>> wrote:
>>> >>
>>> >> Hi Alex,
>>> >>
>>> >> After a git clean -fdx (removing all IDEA resources), I just open the
>>> >> folder in IntelliJ and it imports the project.
>>> >>
>>> >> It works fine so far (NB: I don't build using IntelliJ, it's mostly an
>>> >> editor for me, I use the command line for any other stuff like git,
>>> >> gradle, ...).
>>> >>
>>> >> Regards
>>> >> JB
>>> >>
>>> >> On 01/10/2018 20:05, Alex Amato wrote:
>>> >> > Hello,
>>> >> >
>>> >> > I'm looking to get a good intellij setup working and then update the
>>> >> > documentation how to build and test the java SDK with intelliJ.
>>> >> >
>>> >> > Does anyone have a good setup working, with some tips? I followed
>>> our
>>> >> > instructions here, but I found that after following these steps I
>>> could
>>> >> > not build or test the project. It seemed like the build button did
>>> >> > nothing and the test buttons did not appear.
>>> >> > https://beam.apache.org/contribute/intellij/
>>> >> >
>>> >> > I'm also curious about the gradle support for generating intelliJ
>>> >> > projects. Has anyone tried this as well?
>>> >> >
>>> >> > Any tips would be appreciated.
>>> >> > Thank you,
>>> >> > Alex
>>> >>
>>> >> --
>>> >> Jean-Baptiste Onofré
>>> >> jbono...@apache.org
>>> >> http://blog.nanthrax.net
>>> >> Talend - http://www.talend.com
>>>
>>

-- 




Got feedback? tinyurl.com/swegner-feedback


Re: Add cleanup flag to DockerPayload

2018-10-03 Thread Ankur Goenka
Seems reasonable. A pipeline option should suffice for this purpose.

On Wed, Oct 3, 2018 at 3:02 PM Henning Rohde  wrote:

> IMO it's the runner's responsibility to do container garbage collection
> and disk space management. This flag seems like a implementation-specific
> option that would not only to some runner/deployment combinations, so it
> doesn't seem to belong in that proto. Dataflow would not be able to honor
> such a flag, for example. Perhaps Flink could start all containers with
> --rm and have a Flink-specific debug option to not do that?
>
> Henning
>
> On Wed, Oct 3, 2018 at 2:19 PM Ankur Goenka  wrote:
>
>> Hi,
>>
>> In portable flink runner, SDK Harness docker containers are created
>> dynamically and are not garbage collected. SDK Harness container pull the
>> staging artifact, generate logs and tmp files which is stored as an
>> additional layer on top of image.
>> These dead container layers accumulates over time and make the machine go
>> OOM. To avoid this situation and keep the flexibility of letting containers
>> exist for debugging, I am planning to add cleanup flag to DockerPayload.
>> When set, this flag will tell runner to remove the container after
>> killing.
>> Current DockerPayload:
>>
>> // The payload of a Docker image
>> message DockerPayload {
>>   string container_image = 1;  // implicitly linux_amd64.
>> }
>>
>>
>> Proposed DockerPayload:
>>
>> // The payload of a Docker image
>> message DockerPayload {
>>   string container_image = 1;  // implicitly linux_amd64.
>>   bool cleanup = 2; // Flag to signal container deletion after killing.
>> }
>>
>>
>> Let me know your thoughts and if there is a better way to do this.
>>
>> Thanks,
>> Ankur
>>
>


Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-10-03 Thread Boyuan Zhang
Hey all,

We are tracking the dataflow worker donating process here:
https://issues.apache.org/jira/browse/BEAM-5634 .

Boyuan Zhang

On Mon, Sep 17, 2018 at 5:05 PM Lukasz Cwik  wrote:

> Thanks all, closing the vote with 18 +1s, 5 of which are binding.
>
> I'll try to get this code out and hopefully don't have any legal issues
> within Google or with ASF to perform the donation. Will keep the community
> up to date.
>
> On Mon, Sep 17, 2018 at 3:28 PM Ankur Chauhan  wrote:
>
>> +1
>>
>> Sent from my iPhone
>>
>> On Sep 17, 2018, at 15:26, Ankur Goenka  wrote:
>>
>> +1
>>
>> On Sun, Sep 16, 2018 at 3:20 AM Maximilian Michels 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> On 15.09.18 20:07, Reuven Lax wrote:
>>> > +1
>>> >
>>> > On Sat, Sep 15, 2018 at 9:40 AM Rui Wang >> > > wrote:
>>> >
>>> > +1
>>> >
>>> > -Rui
>>> >
>>> > On Sat, Sep 15, 2018 at 12:32 AM Robert Bradshaw
>>> > mailto:rober...@google.com>> wrote:
>>> >
>>> > +1 (binding)
>>> >
>>> > On Sat, Sep 15, 2018 at 6:44 AM Tim >> > > wrote:
>>> >
>>> > +1
>>> >
>>> > On 15 Sep 2018, at 01:23, Yifan Zou >> > > wrote:
>>> >
>>> >> +1
>>> >>
>>> >> On Fri, Sep 14, 2018 at 4:20 PM David Morávek
>>> >> mailto:david.mora...@gmail.com
>>> >>
>>> >> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >>
>>> >>
>>> >> On 15 Sep 2018, at 00:59, Anton Kedin
>>> >> mailto:ke...@google.com>> wrote:
>>> >>
>>> >>> +1
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold
>>> >>> mailto:amyrv...@google.com>>
>>> wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang
>>> >>> mailto:boyu...@google.com>>
>>> >>> wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde
>>> >>> >> >>> > wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet
>>> >>> Altay >> >>> > wrote:
>>> >>>
>>> >>> +1 (binding)
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at 2:35 PM,
>>> >>> Lukasz Cwik >> >>> > wrote:
>>> >>>
>>> >>> +1 (binding)
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at 2:34 PM
>>> >>> Pablo Estrada <
>>> pabl...@google.com
>>> >>> >
>>> wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at 2:32
>>> >>> PM Andrew Pilloud
>>> >>> >> >>> >> >>
>>> >>> wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Fri, Sep 14, 2018 at
>>> >>> 2:31 PM Lukasz Cwik
>>> >>> >> >>> >
>>> wrote:
>>> >>>
>>> >>> There was generally
>>> >>> positive support and
>>> >>> good feedback[1] but
>>> >>> it was not unanimous.
>>> >>> I wanted to bring the
>>> >>> donation of the
>>> >>> Dataflow worker code
>>> >>> base to Apache Beam
>>> >>> master to a vote.
>>> >>>
>>> >>> +1: Support having
>>> >>> the Dataflow worker
>>> >>> code as part of
>>> >>> Apache Beam master
>>> branch
>>> >>> -1: Dataflow worker
>>> >>> code should live
>>> >>>  

Re: Add cleanup flag to DockerPayload

2018-10-03 Thread Henning Rohde
IMO it's the runner's responsibility to do container garbage collection and
disk space management. This flag seems like a implementation-specific
option that would not only to some runner/deployment combinations, so it
doesn't seem to belong in that proto. Dataflow would not be able to honor
such a flag, for example. Perhaps Flink could start all containers with
--rm and have a Flink-specific debug option to not do that?

Henning

On Wed, Oct 3, 2018 at 2:19 PM Ankur Goenka  wrote:

> Hi,
>
> In portable flink runner, SDK Harness docker containers are created
> dynamically and are not garbage collected. SDK Harness container pull the
> staging artifact, generate logs and tmp files which is stored as an
> additional layer on top of image.
> These dead container layers accumulates over time and make the machine go
> OOM. To avoid this situation and keep the flexibility of letting containers
> exist for debugging, I am planning to add cleanup flag to DockerPayload.
> When set, this flag will tell runner to remove the container after killing.
> Current DockerPayload:
>
> // The payload of a Docker image
> message DockerPayload {
>   string container_image = 1;  // implicitly linux_amd64.
> }
>
>
> Proposed DockerPayload:
>
> // The payload of a Docker image
> message DockerPayload {
>   string container_image = 1;  // implicitly linux_amd64.
>   bool cleanup = 2; // Flag to signal container deletion after killing.
> }
>
>
> Let me know your thoughts and if there is a better way to do this.
>
> Thanks,
> Ankur
>


Add cleanup flag to DockerPayload

2018-10-03 Thread Ankur Goenka
Hi,

In portable flink runner, SDK Harness docker containers are created
dynamically and are not garbage collected. SDK Harness container pull the
staging artifact, generate logs and tmp files which is stored as an
additional layer on top of image.
These dead container layers accumulates over time and make the machine go
OOM. To avoid this situation and keep the flexibility of letting containers
exist for debugging, I am planning to add cleanup flag to DockerPayload.
When set, this flag will tell runner to remove the container after killing.
Current DockerPayload:

// The payload of a Docker image
message DockerPayload {
  string container_image = 1;  // implicitly linux_amd64.
}


Proposed DockerPayload:

// The payload of a Docker image
message DockerPayload {
  string container_image = 1;  // implicitly linux_amd64.
  bool cleanup = 2; // Flag to signal container deletion after killing.
}


Let me know your thoughts and if there is a better way to do this.

Thanks,
Ankur


Re: Why not adding all coders into ModelCoderRegistrar?

2018-10-03 Thread Shen Li
Hi Lukasz,

Is there a way to get the SDK coders (LengthPrefixCoder,
LengthPrefixCoder etc.) instead of a
LengthPrefixCoder on the runner side from
RunnerApi.Pipeline? Our runner needs to serialize the key and use its hash
value to keep some per-key states. Now I am getting the ClassCastException
as the key seen by the runner (an Integer) is not a Byte array.

Thanks,
Shen

On Fri, Sep 28, 2018 at 2:20 PM Shen Li  wrote:

> Thank you, Lukasz!
>
> Best,
> Shen
>
> On Fri, Sep 28, 2018 at 2:11 PM Lukasz Cwik  wrote:
>
>> Runners can never know about every coder that a user may want to write
>> which is why we need to have a mechanism for Runners to be able to convert
>> any unknown coder to one it can handle. This is done via
>> WireCoders.instantiateRunnerWireCoder but this modifies the original coder
>> which is why WireCoders.addSdkWireCoder creates the proto definition that
>> the SDK should be told to use. In your case, your correct in that KV> T> becomes KVCoder,
>> LengthPrefixCoder> on the runner side and on the SDK side
>> it should be KVCoder,
>> LengthPrefixCoder>. More details in [1].
>>
>> 1:
>> http://doc/1IGduUqmhWDi_69l9nG8kw73HZ5WI5wOps9Tshl5wpQA#heading=h.sh4d5klmtfis
>>
>>
>>
>> On Fri, Sep 28, 2018 at 11:02 AM Shen Li  wrote:
>>
>>> Hi,
>>>
>>> I noticed that ModelCoderRegistrar only includes 9 out of ~40 coders.
>>> May I know the rationale behind this decision?
>>>
>>>
>>> https://github.com/apache/beam/blob/release-2.7.0/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ModelCoderRegistrar.java
>>>
>>> I think one consequence of the above configuration is
>>> that WireCoders.instantiateRunnerWireCoder cannot instantiate KV coders
>>> correctly, where VoidCoder (key coder) becomes
>>> LengthPrefixCoder(ByteArrayCoder). What is the appropriate way to get
>>> KvCoder from RunnerApi.Pipeline?
>>>
>>> Thanks,
>>> Shen
>>>
>>


Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-03 Thread Thomas Weise
+1

On Wed, Oct 3, 2018 at 12:33 PM Ted Yu  wrote:

> +1
>
> On Wed, Oct 3, 2018 at 9:52 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> but we have to be fast in release process. 2.7.0 took more than 1 month
>> to be cut !
>>
>> If no blocker, we have to just move forward.
>>
>> Regards
>> JB
>>
>> On 03/10/2018 18:25, Ahmet Altay wrote:
>> > Hi all,
>> >
>> > Release cut date for the next release is 10/10 according to Beam release
>> > calendar [1]. Since the previous release is already mostly wrapped up
>> > (modulo blog post), I would like to propose starting the next release on
>> > time (10/10).
>> >
>> > Additionally I propose designating this release as the first
>> > long-term-support (LTS) release [2]. This should have no impact on the
>> > release process, however it would mean that we commit to patch this
>> > release for the next 12 months for major issues.
>> >
>> > I volunteer to perform this release.
>> >
>> > What do you think?
>> >
>> > Ahmet
>> >
>> > [1]
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles
>> > [2] https://beam.apache.org/community/policies/#releases
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


[ANNOUNCE] Apache Beam 2.7.0 released!

2018-10-03 Thread Charles Chen
The Apache Beam team is pleased to announce the release of version 2.7.0!

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes the following major new features & improvements,
among others:
- New KuduIO, Amazon SNS sink, Amazon SqsIO,
- Dependencies upgraded to new versions.
- Experimental support for Python on local Flink runner for simple examples.
- Various bugfixes and minor improvements.

You can take a look at the Release Notes for more details:

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12343654

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.7.0.
--
Charles Chen, on behalf of The Apache Beam team


Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-03 Thread Andrew Pilloud
+1 for the 2.7.0 release schedule. Thanks for volunteering. Do we want a
standing owner for the LTS branch (like the Linux kernel has) or will we
just take volunteers for each LTS release as they arise?

Andrew

On Wed, Oct 3, 2018 at 12:33 PM Ted Yu  wrote:

> +1
>
> On Wed, Oct 3, 2018 at 9:52 AM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> but we have to be fast in release process. 2.7.0 took more than 1 month
>> to be cut !
>>
>> If no blocker, we have to just move forward.
>>
>> Regards
>> JB
>>
>> On 03/10/2018 18:25, Ahmet Altay wrote:
>> > Hi all,
>> >
>> > Release cut date for the next release is 10/10 according to Beam release
>> > calendar [1]. Since the previous release is already mostly wrapped up
>> > (modulo blog post), I would like to propose starting the next release on
>> > time (10/10).
>> >
>> > Additionally I propose designating this release as the first
>> > long-term-support (LTS) release [2]. This should have no impact on the
>> > release process, however it would mean that we commit to patch this
>> > release for the next 12 months for major issues.
>> >
>> > I volunteer to perform this release.
>> >
>> > What do you think?
>> >
>> > Ahmet
>> >
>> > [1]
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles
>> > [2] https://beam.apache.org/community/policies/#releases
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


Re: Java SDK Extensions

2018-10-03 Thread Ben Chambers
On Wed, Oct 3, 2018 at 12:16 PM Jean-Baptiste Onofré 
wrote:

> Hi Anton,
>
> jackson is the json extension as we have XML. Agree that it should be
> documented.
>
> Agree about join-library.
>
> sketching is some statistic extensions providing ready to use stats
> CombineFn.
>
> Regards
> JB
>
> On 03/10/2018 20:25, Anton Kedin wrote:
> > Hi dev@,
> >
> > *TL;DR:* `sdks/java/extensions` is hard to discover, navigate and
> > understand.
> >
> > *Current State:*
> > *
> > *
> > I was looking at `sdks/java/extensions`[1] and realized that I don't
> > know what half of those things are. Only `join library` and `sorter`
> > seem to be documented and discoverable on Beam website, under SDKs
> > section [2].
> >
> > Here's the list of all extensions with my questions/comments:
> >   - /google-cloud-platform-core/. What is this? Is this used in GCP IOs?
> > If so, is `extensions` the right place for it? If it is, then why is it
> > a `-core` extension? It feels like it's a utility package, not an
> extension;
> >   - /jackson/. I can guess what it is but we should document it
> somewhere;
> >   - /join-library/. It is documented, but I think we should add more
> > documentation to explain how it works, maybe some caveats, and link
> > to/from the `CoGBK` section of the doc;
>

Should also probably indicate that using the join-library twice on the same
with 3 input collections is less efficient than a single CoGBK with those 3
input collections.


> >   - /protobuf/. I can probably guess what is it. Is 'extensions' the
> > right place for it though? We use this library in IOs
> > (`PubsubsIO.readProtos()`), should we move it to IO then? Same as with
> > GCP extension, feels like a utility library, not an extension;
> >   - /sketching/. No idea what to expect from this without reading the
> code;
> >   - /sorter/. Documented on the website;
> >   - /sql/. This looks familiar :) It is documented but not linked from
> > the extensions section, it's unclear whether it's the whole SQL or just
> > some related components;
> >
> > [1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
> > [2]: https://beam.apache.org/documentation/sdks/java-extensions/
> >
> > *Questions:*
> >
> >   - should we minimally document (at least describe) all extensions and
> > add at least short readme.md's with the links to the Beam website?
> >   - is it a right thing to depend on `extensions` in other components
> > like IOs?
> >   - would it make sense to move some things out of 'extensions'? E.g. IO
> > components to IO or utility package, SQL into new DSLs package;
> >
> > *Opinion:*
> > *
> > *
> > Maybe I am misunderstanding the intent and meaning of 'extensions', but
> > from my perspective:
> > *
> > *
> >   - I think that extensions should be more or less isolated from the
> > Beam SDK itself, so that if you delete or modify them, no Beam-internal
> > changes will be required (changes to something that's not an extension).
> > And my feeling is that they should provide value by themselves to users
> > other than SDK authors. They are called 'extensions', not 'critical
> > components' or 'sdk utilities';
> >
> >   - I don't think that IOs should depend on 'extensions'. Otherwise the
> > question is, is it ok for other components, like runners, to do the
> > same? Or even core?
> >
> >   - I think there are few distinguishable classes of things in
> > 'extensions' right now:
> >   - collections of `PTransforms` with some business logic (Sorter,
> > Join, Sketch);
> >   - collections of `PTransforms` with focus parsing (Jackson,
> Protobuf);
> >   - DSLs; SQL DSL with more than just a few `PTransforms`, it can be
> > used almost as a standalone SDK. Things like Euphoria will probably end
> > up in the same class;
> >   - utility libraries shared by some parts of the SDK and unclear if
> > they are valuable by themselves to external users (Protobuf, GCP core);
> > To me the business logic and parsing libraries do make sense to stay
> > in extensions, but probably under different subdirectories. I think it
> > will make sense to split others out of extensions into separate parts of
> > the SDK.
> >
> >   - I think we should add readme.md's with short descriptions and links
> > to Beam website;
> >
> > Thoughts, comments?
> >
> >
> > [1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
> > [2]: https://beam.apache.org/documentation/sdks/java-extensions/
>


Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-03 Thread Ted Yu
+1

On Wed, Oct 3, 2018 at 9:52 AM Jean-Baptiste Onofré  wrote:

> +1
>
> but we have to be fast in release process. 2.7.0 took more than 1 month
> to be cut !
>
> If no blocker, we have to just move forward.
>
> Regards
> JB
>
> On 03/10/2018 18:25, Ahmet Altay wrote:
> > Hi all,
> >
> > Release cut date for the next release is 10/10 according to Beam release
> > calendar [1]. Since the previous release is already mostly wrapped up
> > (modulo blog post), I would like to propose starting the next release on
> > time (10/10).
> >
> > Additionally I propose designating this release as the first
> > long-term-support (LTS) release [2]. This should have no impact on the
> > release process, however it would mean that we commit to patch this
> > release for the next 12 months for major issues.
> >
> > I volunteer to perform this release.
> >
> > What do you think?
> >
> > Ahmet
> >
> > [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles
> > [2] https://beam.apache.org/community/policies/#releases
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Java SDK Extensions

2018-10-03 Thread Jean-Baptiste Onofré

Hi Anton,

jackson is the json extension as we have XML. Agree that it should be 
documented.


Agree about join-library.

sketching is some statistic extensions providing ready to use stats 
CombineFn.


Regards
JB

On 03/10/2018 20:25, Anton Kedin wrote:

Hi dev@,

*TL;DR:* `sdks/java/extensions` is hard to discover, navigate and 
understand.


*Current State:*
*
*
I was looking at `sdks/java/extensions`[1] and realized that I don't 
know what half of those things are. Only `join library` and `sorter` 
seem to be documented and discoverable on Beam website, under SDKs 
section [2].


Here's the list of all extensions with my questions/comments:
  - /google-cloud-platform-core/. What is this? Is this used in GCP IOs? 
If so, is `extensions` the right place for it? If it is, then why is it 
a `-core` extension? It feels like it's a utility package, not an extension;

  - /jackson/. I can guess what it is but we should document it somewhere;
  - /join-library/. It is documented, but I think we should add more 
documentation to explain how it works, maybe some caveats, and link 
to/from the `CoGBK` section of the doc;
  - /protobuf/. I can probably guess what is it. Is 'extensions' the 
right place for it though? We use this library in IOs 
(`PubsubsIO.readProtos()`), should we move it to IO then? Same as with 
GCP extension, feels like a utility library, not an extension;

  - /sketching/. No idea what to expect from this without reading the code;
  - /sorter/. Documented on the website;
  - /sql/. This looks familiar :) It is documented but not linked from 
the extensions section, it's unclear whether it's the whole SQL or just 
some related components;


[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

*Questions:*

  - should we minimally document (at least describe) all extensions and 
add at least short readme.md's with the links to the Beam website?
  - is it a right thing to depend on `extensions` in other components 
like IOs?
  - would it make sense to move some things out of 'extensions'? E.g. IO 
components to IO or utility package, SQL into new DSLs package;


*Opinion:*
*
*
Maybe I am misunderstanding the intent and meaning of 'extensions', but 
from my perspective:

*
*
  - I think that extensions should be more or less isolated from the 
Beam SDK itself, so that if you delete or modify them, no Beam-internal 
changes will be required (changes to something that's not an extension). 
And my feeling is that they should provide value by themselves to users 
other than SDK authors. They are called 'extensions', not 'critical 
components' or 'sdk utilities';


  - I don't think that IOs should depend on 'extensions'. Otherwise the 
question is, is it ok for other components, like runners, to do the 
same? Or even core?


  - I think there are few distinguishable classes of things in 
'extensions' right now:
      - collections of `PTransforms` with some business logic (Sorter, 
Join, Sketch);

      - collections of `PTransforms` with focus parsing (Jackson, Protobuf);
      - DSLs; SQL DSL with more than just a few `PTransforms`, it can be 
used almost as a standalone SDK. Things like Euphoria will probably end 
up in the same class;
      - utility libraries shared by some parts of the SDK and unclear if 
they are valuable by themselves to external users (Protobuf, GCP core);
    To me the business logic and parsing libraries do make sense to stay 
in extensions, but probably under different subdirectories. I think it 
will make sense to split others out of extensions into separate parts of 
the SDK.


  - I think we should add readme.md's with short descriptions and links 
to Beam website;


Thoughts, comments?


[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/


Java SDK Extensions

2018-10-03 Thread Anton Kedin
Hi dev@,

*TL;DR:* `sdks/java/extensions` is hard to discover, navigate and
understand.

*Current State:*

I was looking at `sdks/java/extensions`[1] and realized that I don't know
what half of those things are. Only `join library` and `sorter` seem to be
documented and discoverable on Beam website, under SDKs section [2].

Here's the list of all extensions with my questions/comments:
 - *google-cloud-platform-core*. What is this? Is this used in GCP IOs? If
so, is `extensions` the right place for it? If it is, then why is it a
`-core` extension? It feels like it's a utility package, not an extension;
 - *jackson*. I can guess what it is but we should document it somewhere;
 - *join-library*. It is documented, but I think we should add more
documentation to explain how it works, maybe some caveats, and link to/from
the `CoGBK` section of the doc;
 - *protobuf*. I can probably guess what is it. Is 'extensions' the right
place for it though? We use this library in IOs (`PubsubsIO.readProtos()`),
should we move it to IO then? Same as with GCP extension, feels like a
utility library, not an extension;
 - *sketching*. No idea what to expect from this without reading the code;
 - *sorter*. Documented on the website;
 - *sql*. This looks familiar :) It is documented but not linked from the
extensions section, it's unclear whether it's the whole SQL or just some
related components;

[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/

*Questions:*

 - should we minimally document (at least describe) all extensions and add
at least short readme.md's with the links to the Beam website?
 - is it a right thing to depend on `extensions` in other components like
IOs?
 - would it make sense to move some things out of 'extensions'? E.g. IO
components to IO or utility package, SQL into new DSLs package;

*Opinion:*

Maybe I am misunderstanding the intent and meaning of 'extensions', but
from my perspective:

 - I think that extensions should be more or less isolated from the Beam
SDK itself, so that if you delete or modify them, no Beam-internal changes
will be required (changes to something that's not an extension). And my
feeling is that they should provide value by themselves to users other than
SDK authors. They are called 'extensions', not 'critical components' or
'sdk utilities';

 - I don't think that IOs should depend on 'extensions'. Otherwise the
question is, is it ok for other components, like runners, to do the same?
Or even core?

 - I think there are few distinguishable classes of things in 'extensions'
right now:
 - collections of `PTransforms` with some business logic (Sorter, Join,
Sketch);
 - collections of `PTransforms` with focus parsing (Jackson, Protobuf);
 - DSLs; SQL DSL with more than just a few `PTransforms`, it can be
used almost as a standalone SDK. Things like Euphoria will probably end up
in the same class;
 - utility libraries shared by some parts of the SDK and unclear if
they are valuable by themselves to external users (Protobuf, GCP core);
   To me the business logic and parsing libraries do make sense to stay in
extensions, but probably under different subdirectories. I think it will
make sense to split others out of extensions into separate parts of the
SDK.

 - I think we should add readme.md's with short descriptions and links to
Beam website;

Thoughts, comments?


[1]: https://github.com/apache/beam/tree/master/sdks/java/extensions
[2]: https://beam.apache.org/documentation/sdks/java-extensions/


Re: Move mock classes out of test directory in BeamSQL

2018-10-03 Thread Rui Wang
Thanks. Looks like at least moving mock is an accepted idea. I will come up
a moving plan later (either to separate or to src/main, no matter what
makes sense) and share it with you.


-Rui

On Wed, Oct 3, 2018 at 8:15 AM Andrew Pilloud  wrote:

> The sql module's tests depend on mocks and mocks depend on sql module, so
> moving this to a separate module creates a weird dependency graph. I don't
> think it is strictly circular but it comes close. Can we just move the
> folder from 'src/test' to 'src/main' and mark everything @Experimental?
>
> Andrew
>
> On Wed, Oct 3, 2018 at 2:28 AM Kai Jiang  wrote:
>
>> Big +1.
>>
>> Best,
>> Kai
>> ᐧ
>>
>> On Mon, Oct 1, 2018 at 10:42 PM Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1
>>>
>>> it makes sense.
>>>
>>> Regards
>>> JB
>>>
>>> On 02/10/2018 01:32, Rui Wang wrote:
>>> > Hi Community,
>>> >
>>> > BeamSQL defines some mock classes (see mock
>>> > <
>>> https://github.com/apache/beam/tree/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/mock
>>> >)
>>> > in a test directory. As there is more than one module under sql
>>> > >> > now,
>>> > there is a need to share these mock classes among modules.
>>> >
>>> > So I want to move these mock classes to a separate module under sql
>>> > ,
>>> > so other modules' tests can depend on this mock module.
>>> >
>>> >
>>> > What do you think of this idea?
>>> >
>>> >
>>> > -Rui
>>> >
>>> >
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>


Re: [Proposal] Add exception handling option to MapElements

2018-10-03 Thread Jeff Klukas
I don't personally have experience with the Python SDK, so am not
immediately in a position to comment on how feasible it would be to
introduce a similar change there. I'll plan to write up two separate issues
for adding exception handling in the Java and Python SDKs.

On Wed, Oct 3, 2018 at 12:17 PM Thomas Weise  wrote:

> +1 for the proposal as well as the suggestion to offer it in other SDKs,
> where applicable
>
> On Wed, Oct 3, 2018 at 8:58 AM Chamikara Jayalath 
> wrote:
>
>> Sounds like a very good addition. I'd say this can be a single PR since
>> changes are related. Please open a JIRA for tracking.
>>
>> Have you though about introducing a similar change to Python SDK ?
>> (doesn't have to be the same PR).
>>
>> - Cham
>>
>> On Wed, Oct 3, 2018 at 8:31 AM Jeff Klukas  wrote:
>>
>>> If this looks good for MapElements, I agree that it makes sense to
>>> extend to FlatMapElements and Filter and to keep the API consistent between
>>> them.
>>>
>>> Do you have suggestions on how to submit changes with that wider scope?
>>> Would one PR altering MapElements, FlatMapElements, Filter, ParseJsons, and
>>> AsJsons be too large to reasonably review? Should I open an overall JIRA
>>> ticket to track and break this into smaller  PRs?
>>>
>>> On Wed, Oct 3, 2018 at 10:31 AM Reuven Lax  wrote:
>>>
 Sounds cool. Why not support this on other transforms as well?
 (FlatMapElements, Filter, etc.)

 Reuven

 On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas  wrote:

> I've seen a few Beam users mention the need to handle errors in their
> transforms by using a try/catch and routing to different outputs based on
> whether an exception was thrown. This was particularly nicely written up 
> in
> a post by Vallery Lancey:
>
>
> https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
>
> I'd love to see this pattern better supported directly in the Beam
> API, because it currently requires the user to implement a full DoFn even
> for the simplest cases.
>
> I propose we support for a MapElements-like transform that allows the
> user to specify a set of exceptions to catch and route to a failure 
> output.
> Something like:
>
> MapElements
> .via(myFunctionThatThrows)
> .withSuccessTag(successTag)
> .withFailureTag(failureTag, JsonParsingException.class)
>
> which would output a PCollectionTuple with both the successful
> outcomes of the map operation and also a collection of the inputs that
> threw JsonParsingException.
>
> To make this more concrete, I put together a proof of concept PR:
> https://github.com/apache/beam/pull/6518  I'd appreciate feedback
> about whether this seems like a worthwhile addition and a feasible 
> approach.
>



Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-03 Thread Jean-Baptiste Onofré
+1

but we have to be fast in release process. 2.7.0 took more than 1 month
to be cut !

If no blocker, we have to just move forward.

Regards
JB

On 03/10/2018 18:25, Ahmet Altay wrote:
> Hi all,
> 
> Release cut date for the next release is 10/10 according to Beam release
> calendar [1]. Since the previous release is already mostly wrapped up
> (modulo blog post), I would like to propose starting the next release on
> time (10/10).
> 
> Additionally I propose designating this release as the first
> long-term-support (LTS) release [2]. This should have no impact on the
> release process, however it would mean that we commit to patch this
> release for the next 12 months for major issues.
> 
> I volunteer to perform this release.
> 
> What do you think?
> 
> Ahmet 
> 
> [1] 
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles
> [2] https://beam.apache.org/community/policies/#releases

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[PROPOSAL] Prepare Beam 2.8.0 release

2018-10-03 Thread Ahmet Altay
Hi all,

Release cut date for the next release is 10/10 according to Beam release
calendar [1]. Since the previous release is already mostly wrapped up
(modulo blog post), I would like to propose starting the next release on
time (10/10).

Additionally I propose designating this release as the first
long-term-support (LTS) release [2]. This should have no impact on the
release process, however it would mean that we commit to patch this release
for the next 12 months for major issues.

I volunteer to perform this release.

What do you think?

Ahmet

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles
[2] https://beam.apache.org/community/policies/#releases


Re: [Proposal] Add exception handling option to MapElements

2018-10-03 Thread Thomas Weise
+1 for the proposal as well as the suggestion to offer it in other SDKs,
where applicable

On Wed, Oct 3, 2018 at 8:58 AM Chamikara Jayalath 
wrote:

> Sounds like a very good addition. I'd say this can be a single PR since
> changes are related. Please open a JIRA for tracking.
>
> Have you though about introducing a similar change to Python SDK ?
> (doesn't have to be the same PR).
>
> - Cham
>
> On Wed, Oct 3, 2018 at 8:31 AM Jeff Klukas  wrote:
>
>> If this looks good for MapElements, I agree that it makes sense to extend
>> to FlatMapElements and Filter and to keep the API consistent between them.
>>
>> Do you have suggestions on how to submit changes with that wider scope?
>> Would one PR altering MapElements, FlatMapElements, Filter, ParseJsons, and
>> AsJsons be too large to reasonably review? Should I open an overall JIRA
>> ticket to track and break this into smaller  PRs?
>>
>> On Wed, Oct 3, 2018 at 10:31 AM Reuven Lax  wrote:
>>
>>> Sounds cool. Why not support this on other transforms as well?
>>> (FlatMapElements, Filter, etc.)
>>>
>>> Reuven
>>>
>>> On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas  wrote:
>>>
 I've seen a few Beam users mention the need to handle errors in their
 transforms by using a try/catch and routing to different outputs based on
 whether an exception was thrown. This was particularly nicely written up in
 a post by Vallery Lancey:


 https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a

 I'd love to see this pattern better supported directly in the Beam API,
 because it currently requires the user to implement a full DoFn even for
 the simplest cases.

 I propose we support for a MapElements-like transform that allows the
 user to specify a set of exceptions to catch and route to a failure output.
 Something like:

 MapElements
 .via(myFunctionThatThrows)
 .withSuccessTag(successTag)
 .withFailureTag(failureTag, JsonParsingException.class)

 which would output a PCollectionTuple with both the successful outcomes
 of the map operation and also a collection of the inputs that threw
 JsonParsingException.

 To make this more concrete, I put together a proof of concept PR:
 https://github.com/apache/beam/pull/6518  I'd appreciate feedback
 about whether this seems like a worthwhile addition and a feasible 
 approach.

>>>


Re: [Proposal] Add exception handling option to MapElements

2018-10-03 Thread Chamikara Jayalath
Sounds like a very good addition. I'd say this can be a single PR since
changes are related. Please open a JIRA for tracking.

Have you though about introducing a similar change to Python SDK ? (doesn't
have to be the same PR).

- Cham

On Wed, Oct 3, 2018 at 8:31 AM Jeff Klukas  wrote:

> If this looks good for MapElements, I agree that it makes sense to extend
> to FlatMapElements and Filter and to keep the API consistent between them.
>
> Do you have suggestions on how to submit changes with that wider scope?
> Would one PR altering MapElements, FlatMapElements, Filter, ParseJsons, and
> AsJsons be too large to reasonably review? Should I open an overall JIRA
> ticket to track and break this into smaller  PRs?
>
> On Wed, Oct 3, 2018 at 10:31 AM Reuven Lax  wrote:
>
>> Sounds cool. Why not support this on other transforms as well?
>> (FlatMapElements, Filter, etc.)
>>
>> Reuven
>>
>> On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas  wrote:
>>
>>> I've seen a few Beam users mention the need to handle errors in their
>>> transforms by using a try/catch and routing to different outputs based on
>>> whether an exception was thrown. This was particularly nicely written up in
>>> a post by Vallery Lancey:
>>>
>>>
>>> https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
>>>
>>> I'd love to see this pattern better supported directly in the Beam API,
>>> because it currently requires the user to implement a full DoFn even for
>>> the simplest cases.
>>>
>>> I propose we support for a MapElements-like transform that allows the
>>> user to specify a set of exceptions to catch and route to a failure output.
>>> Something like:
>>>
>>> MapElements
>>> .via(myFunctionThatThrows)
>>> .withSuccessTag(successTag)
>>> .withFailureTag(failureTag, JsonParsingException.class)
>>>
>>> which would output a PCollectionTuple with both the successful outcomes
>>> of the map operation and also a collection of the inputs that threw
>>> JsonParsingException.
>>>
>>> To make this more concrete, I put together a proof of concept PR:
>>> https://github.com/apache/beam/pull/6518  I'd appreciate feedback about
>>> whether this seems like a worthwhile addition and a feasible approach.
>>>
>>


Re: [apachecon 2018] Universal metrics with apache beam

2018-10-03 Thread Chamikara Jayalath
It was a very interesting talk indeed and was very well delivered. Thanks
Etienne.

- Cham

On Wed, Oct 3, 2018 at 8:16 AM Ted Yu  wrote:

> Very interesting talk, Etienne.
>
> Looking forward to the audio recording.
>
> Cheers
>


Re: [Proposal] Add exception handling option to MapElements

2018-10-03 Thread Jeff Klukas
If this looks good for MapElements, I agree that it makes sense to extend
to FlatMapElements and Filter and to keep the API consistent between them.

Do you have suggestions on how to submit changes with that wider scope?
Would one PR altering MapElements, FlatMapElements, Filter, ParseJsons, and
AsJsons be too large to reasonably review? Should I open an overall JIRA
ticket to track and break this into smaller  PRs?

On Wed, Oct 3, 2018 at 10:31 AM Reuven Lax  wrote:

> Sounds cool. Why not support this on other transforms as well?
> (FlatMapElements, Filter, etc.)
>
> Reuven
>
> On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas  wrote:
>
>> I've seen a few Beam users mention the need to handle errors in their
>> transforms by using a try/catch and routing to different outputs based on
>> whether an exception was thrown. This was particularly nicely written up in
>> a post by Vallery Lancey:
>>
>>
>> https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
>>
>> I'd love to see this pattern better supported directly in the Beam API,
>> because it currently requires the user to implement a full DoFn even for
>> the simplest cases.
>>
>> I propose we support for a MapElements-like transform that allows the
>> user to specify a set of exceptions to catch and route to a failure output.
>> Something like:
>>
>> MapElements
>> .via(myFunctionThatThrows)
>> .withSuccessTag(successTag)
>> .withFailureTag(failureTag, JsonParsingException.class)
>>
>> which would output a PCollectionTuple with both the successful outcomes
>> of the map operation and also a collection of the inputs that threw
>> JsonParsingException.
>>
>> To make this more concrete, I put together a proof of concept PR:
>> https://github.com/apache/beam/pull/6518  I'd appreciate feedback about
>> whether this seems like a worthwhile addition and a feasible approach.
>>
>


Re: [apachecon 2018] Universal metrics with apache beam

2018-10-03 Thread Ted Yu
Very interesting talk, Etienne.

Looking forward to the audio recording.

Cheers


Re: Move mock classes out of test directory in BeamSQL

2018-10-03 Thread Andrew Pilloud
The sql module's tests depend on mocks and mocks depend on sql module, so
moving this to a separate module creates a weird dependency graph. I don't
think it is strictly circular but it comes close. Can we just move the
folder from 'src/test' to 'src/main' and mark everything @Experimental?

Andrew

On Wed, Oct 3, 2018 at 2:28 AM Kai Jiang  wrote:

> Big +1.
>
> Best,
> Kai
> ᐧ
>
> On Mon, Oct 1, 2018 at 10:42 PM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> it makes sense.
>>
>> Regards
>> JB
>>
>> On 02/10/2018 01:32, Rui Wang wrote:
>> > Hi Community,
>> >
>> > BeamSQL defines some mock classes (see mock
>> > <
>> https://github.com/apache/beam/tree/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/mock
>> >)
>> > in a test directory. As there is more than one module under sql
>> > > > now,
>> > there is a need to share these mock classes among modules.
>> >
>> > So I want to move these mock classes to a separate module under sql
>> > ,
>> > so other modules' tests can depend on this mock module.
>> >
>> >
>> > What do you think of this idea?
>> >
>> >
>> > -Rui
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


[apachecon 2018] Universal metrics with apache beam

2018-10-03 Thread Etienne Chauchot
Hi everyone,
At the Apachecon 2018 on Sept 26th, I did a talk on "universal metrics with 
apache beam". This talk describes the
metrics system in Beam, its integration with the runners and how the metrics 
can be everywhere: extracted to external
sinks independent from the chosen runner and also how metrics can be portable 
part of the WIP with the portability
framework.

If you are interested, here are the slides
 https://s.apache.org/universal-metrics

Unfortunately the session was not video recorded (only keynotes are) but it was 
supposed to be audio recorded. I cannot
find the link to the recording right now, I'll update you when it is uploaded.

Best
Etienne



Re: [Proposal] Add exception handling option to MapElements

2018-10-03 Thread Reuven Lax
Sounds cool. Why not support this on other transforms as well?
(FlatMapElements, Filter, etc.)

Reuven

On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas  wrote:

> I've seen a few Beam users mention the need to handle errors in their
> transforms by using a try/catch and routing to different outputs based on
> whether an exception was thrown. This was particularly nicely written up in
> a post by Vallery Lancey:
>
>
> https://medium.com/@vallerylancey/error-handling-elements-in-apache-beam-pipelines-fffdea91af2a
>
> I'd love to see this pattern better supported directly in the Beam API,
> because it currently requires the user to implement a full DoFn even for
> the simplest cases.
>
> I propose we support for a MapElements-like transform that allows the user
> to specify a set of exceptions to catch and route to a failure output.
> Something like:
>
> MapElements
> .via(myFunctionThatThrows)
> .withSuccessTag(successTag)
> .withFailureTag(failureTag, JsonParsingException.class)
>
> which would output a PCollectionTuple with both the successful outcomes of
> the map operation and also a collection of the inputs that threw
> JsonParsingException.
>
> To make this more concrete, I put together a proof of concept PR:
> https://github.com/apache/beam/pull/6518  I'd appreciate feedback about
> whether this seems like a worthwhile addition and a feasible approach.
>


Re: Metrics Pusher support on Dataflow

2018-10-03 Thread Etienne Chauchot
Hi Scott,Thanks for the update. Both solutions look good to me. Though, they 
both have plus and minus. I let the
googlers chose which is more appropriate:
- DAG modifcation: less intrusive in Dataflow but the DAG executed and shown in 
the DAG UI in dataflow will contain an
extra step that the user might wonder about.- polling thread: it is exactly 
what I did for the other runners, it is more
transparent to the user but  requires more infra work (adds a container that 
needs to be resilient)
BestEtienne
Le vendredi 21 septembre 2018 à 12:46 -0700, Scott Wegner a écrit :
> Hi Etienne, sorry for the delay on this. I just got back from leave and found 
> this discussion.
> We haven't started implementing MetricsPusher in the Dataflow runner, mostly 
> because the Dataflow service has it's own
> rich Metrics REST API and we haven't heard a need from Dataflow customers to 
> push metrics to an external backend.
> However, it would be nice to have this implemented across all runners for 
> feature parity.
> 
> I read through the discussion in JIRA [1], and the simplest implementation 
> for Dataflow may be to have a single thread
> periodically poll the Dataflow REST API [2] for latest metric values, and 
> push them to a configured sink. This polling
> thread could be hosted in a separate docker container, within the worker 
> process, or perhaps a ParDo with timers that
> gets injected into the pipeline during graph translation.
> 
> At any rate, I'm not aware of anybody currently working on this. But with the 
> Dataflow worker code being donated to
> Beam [3], soon it will be possible for anybody to contribute.
> 
> [1] https://issues.apache.org/jira/browse/BEAM-3926
> [2] 
> https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs/getMetrics
> [3] 
> https://lists.apache.org/thread.html/2bdc645659e2fbd7e29f3a2758941faefedb01148a2a11558dfe60f8@%3Cdev.beam.apache.o
> rg%3E
> 
> On Fri, Aug 17, 2018 at 4:26 PM Lukasz Cwik  wrote:
> > I forwarded your request to a few people who work on the internal parts of 
> > Dataflow to see if they could help in
> > some way.
> > On Thu, Aug 16, 2018 at 6:22 AM Etienne Chauchot  
> > wrote:
> > > Hi all
> > > 
> > > As we already discussed, it would be good to support Metrics Pusher [1] 
> > > in Dataflow (in other runners also, of
> > > course). Today, only Spark and Flink support it. It requires a 
> > > modification in C++ Dataflow code, so only Google
> > > friends can do it. 
> > > 
> > > Is someone interested in doing it ? 
> > > 
> > > Here is the ticket https://issues.apache.org/jira/browse/BEAM-3926
> > > 
> > > Besides, I wonder if this feature should be added to the capability 
> > > matrix.
> > > 
> > > [1] 
> > > https://cwiki.apache.org/confluence/display/BEAM/Metrics+architecture+inside+the+runners
> > > 
> > > Thanks
> > > Etienne
> 
> 

Re: Move mock classes out of test directory in BeamSQL

2018-10-03 Thread Kai Jiang
Big +1.

Best,
Kai
ᐧ

On Mon, Oct 1, 2018 at 10:42 PM Jean-Baptiste Onofré 
wrote:

> +1
>
> it makes sense.
>
> Regards
> JB
>
> On 02/10/2018 01:32, Rui Wang wrote:
> > Hi Community,
> >
> > BeamSQL defines some mock classes (see mock
> > <
> https://github.com/apache/beam/tree/master/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/mock
> >)
> > in a test directory. As there is more than one module under sql
> >  > now,
> > there is a need to share these mock classes among modules.
> >
> > So I want to move these mock classes to a separate module under sql
> > ,
> > so other modules' tests can depend on this mock module.
> >
> >
> > What do you think of this idea?
> >
> >
> > -Rui
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Python 3: final step

2018-10-03 Thread Valentyn Tymofieiev
Hi Rakesh and Manu,

Thanks to both of you for offering help (in different threads). It's great
to see that more and more people get involved with helping to make Beam
Python 3 compatible!

There are a few PRs in flight, and several people in the community actively
work on Python 3 support now. I would be happy to coordinate the work so
that we don't step at each others toes and avoid duplication of effort.

I recently looked at unit tests that are still failing in Python 3
environment  and filed a few issues (within range BEAM-5615 - BEAM-5629),
to track similar classes of errors. You can also find them on Kanban board
[1].
In particular, BEAM-5620 and BEAM-5627 should be easy issues to get
started.

There are multiple ways you can help:
- Helping to rootcause errors. Even a comment why a test is failing and a
suggestion how to fix it, will be helpful for others when you don't have
time to do the fix.
- Helping with code reviews.
- Reporting new issues (as subtasks to BEAM-1251), deduplicating or
splitting the existing issues. We probably don't want to file a Jira for
each of 250+ currently failing tests at this point, but it may make sense
to track the errors that occur repeatedly share the root cause.
- Fixing the issues. Feel free to assign an issue to yourself if you have a
fix in mind and plan to actively work on it. Due to the nature of the
problem it may occasionally happen that two issues share the rootcause, or
fixing one issue is a prerequisite for fixing another issue, so sync to
master often to make sure the issue you are working on is not already
fixed.

I'll also keep an eye on the PRs and will try to keep the list of open
issues up to date.

Thanks,
Valentyn

[1]:
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail


On Tue, Oct 2, 2018 at 9:38 AM Pablo Estrada  wrote:

> Very cool : ) I'm also available to review / merge if you need help from
> my side.
> Best
> -P.
>
> On Tue, Oct 2, 2018 at 7:45 AM Rakesh Kumar  wrote:
>
>> Hi Rob,
>>
>> I am, Rakesh Kumar, using Beam SDK for one of my projects at Lyft. I have
>> been working closely with Thomas Weise. I have already met a couple of
>> Python SDK developers in person.
>> I am interested to help migrate to Python 3. You can assign me PRs for
>> review. I am also more than happy to take a simple ticket to begin
>> development work on Beam.
>>
>> Thank you,
>> Rakesh
>>
>> On Wed, Sep 5, 2018 at 9:12 AM Robbe Sneyders 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> With the merging of [1], we now have Python 3 tests running on Jenkins,
>>> which allows us to move forward with the last step of the Python 3 porting.
>>>
>>> You can follow the progress on the Jira Kanban Board [2]. If you're
>>> interested in helping by porting a module, you can assign one of the issues
>>> to yourself and start coding. You can find the different steps outlined in
>>> the design document [3].
>>>
>>> We could also use some extra reviewers. If you're interested, let us
>>> know, and we'll tag you in our PRs.
>>>
>>> [1] https://github.com/apache/beam/pull/6266
>>> [2] https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245
>>> [3] https://s.apache.org/beam-python-3
>>>
>>> kind regards,
>>> Robbe
>>> --
>>>
>>> [image: https://ml6.eu] 
>>>
>>> * Robbe Sneyders*
>>>
>>> ML6 Gent
>>> 
>>>
>>> M: +32 474 71 31 08 <+32%20474%2071%2031%2008>
>>>
>> --
>> Rakesh Kumar
>> Software Engineer
>> 510-761-1364 <(510)%20761-1364> |
>>
>> 
>>
>


Re: Purpose of GcpApiSurfaceTest in google-cloud-platform SDK

2018-10-03 Thread Ismaël Mejía
Just adding a bit to what Kenn said, this is the JIRA that covers the
classpath issue (which seem not to affect you at least)
https://issues.apache.org/jira/browse/BEAM-3748
Note also that the classpath based resolution is not future
compatible, I saw it broke when working on the migration to Java 11
(long time ago) so we probably need to replace/improve it.
On Wed, Oct 3, 2018 at 5:56 AM Kenneth Knowles  wrote:
>
> Worth noting that these API surface tests are not in a great state; they test 
> everything on the class path rather than just true dependencies. I don't know 
> that they still ensure the desired property; they certainly reject things 
> that they need not. From your description, it sounds like in your case they 
> have worked as intended.
>
> Kenn
>
> On Tue, Oct 2, 2018 at 1:37 PM Andrew Pilloud  wrote:
>>
>> Hi Ken,
>>
>> My understanding is that this test is intended to prevent other packages 
>> from appearing on the public API surface of the Beam package. For example, 
>> guava can't appear on the Beam public API. This is to enable users to depend 
>> on different versions of these packages then what Beam depends on.
>>
>> See 
>> https://issues.apache.org/jira/browse/BEAM-878https://issues.apache.org/jira/browse/BEAM-878
>>
>> We might be able to provide more guidance if you open a Beam PR with your 
>> changes so we can see the diff and test failures.
>>
>> Andrew
>>
>> On Tue, Oct 2, 2018 at 1:02 PM Kenneth Jung  wrote:
>>>
>>> Hi folks,
>>>
>>> I'm working on adding support for a new API to the google-cloud-platform 
>>> SDK, and some of my changes have caused the API surface test to start 
>>> failing. However, it's not totally clear to me what the purpose of this 
>>> test is -- it doesn't fail when new build-time dependencies are added, but 
>>> only when new types appear on the public API surface of the package. What 
>>> is the purpose of this test? Is it to prevent accidental additions of new 
>>> dependencies, or to make sure that the shadow configuration stays in sync 
>>> with the content of the package, or is there something else I'm not 
>>> thinking of? This will affect how I go about addressing the failures.
>>>
>>> Thanks
>>> Ken


Re: SplittableDoFn

2018-10-03 Thread Alex Van Boxel
Yes, but we need at least Mongo 4.0 to make it production ready. I wouldn't
let anyone work with anything less because you can't checkpoint). I'm
waiting till our test cluster is 4.0 to continue on this.

 _/
_/ Alex Van Boxel


On Wed, Oct 3, 2018 at 9:43 AM Ismaël Mejía  wrote:

> Hello Axel, Thanks for sharing, really interesting quest story, we
> really need more like this (kudos for the animations too).
> Are you planning to contribute the continous SDF based version of the
> mongo connector into Beam upstream (once ready)?
>
>
> On Wed, Oct 3, 2018 at 7:07 AM Jean-Baptiste Onofré 
> wrote:
> >
> > Nice one Alex !
> >
> > Thanks
> > Regards
> > JB
> >
> > On 02/10/2018 23:19, Alex Van Boxel wrote:
> > > Don't want to crash the tech discussion here, but... I just gave a
> > > session at the Beam Summit about Splittable DoFn's as a users
> > > perspective (from things I could gather from the documentation and
> > > experimentation). Her is the slides deck, maybe it could be
> > > useful:
> https://docs.google.com/presentation/d/1dSc6oKh5pZItQPB_QiUyEoLT2TebMnj-pmdGipkVFPk/edit?usp=sharing
> (quite
> > > proud of the animations though ;-)
> > >
> > >  _/
> > > _/ Alex Van Boxel
> > >
> > >
> > > On Thu, Sep 27, 2018 at 12:04 AM Lukasz Cwik  > > > wrote:
> > >
> > > Reuven, just inside the restriction tracker itself which is scoped
> > > per executing SplittableDoFn. A user could incorrectly write the
> > > synchronization since they are currently responsible for writing it
> > > though.
> > >
> > > On Wed, Sep 26, 2018 at 2:51 PM Reuven Lax  > > > wrote:
> > >
> > > is synchronization over an entire work item, or just inside
> > > restriction tracker? my concern is that some runners
> (especially
> > > streaming runners) might have hundreds or thousands of parallel
> > > work items being processed for the same SDF (for different
> > > keys), and I'm afraid of creating lock-contention bottlenecks.
> > >
> > > On Fri, Sep 21, 2018 at 3:42 PM Lukasz Cwik  > > > wrote:
> > >
> > > The synchronization is related to Java thread safety since
> > > there is likely to be concurrent access needed to a
> > > restriction tracker to properly handle accessing the
> backlog
> > > and splitting concurrently from when the users DoFn is
> > > executing and updating the restriction tracker. This is
> > > similar to the Java thread safety needed in BoundedSource
> > > and UnboundedSource for fraction consumed, backlog bytes,
> > > and splitting.
> > >
> > > On Fri, Sep 21, 2018 at 2:38 PM Reuven Lax <
> re...@google.com
> > > > wrote:
> > >
> > > Can you give details on what the synchronization is
> per?
> > > Is it per key, or global to each worker?
> > >
> > > On Fri, Sep 21, 2018 at 2:10 PM Lukasz Cwik
> > > mailto:lc...@google.com>> wrote:
> > >
> > > As I was looking at the SplittableDoFn API while
> > > working towards making a proposal for how the
> > > backlog/splitting API could look, I found some
> sharp
> > > edges that could be improved.
> > >
> > > I noticed that:
> > > 1) We require users to write thread safe code, this
> > > is something that we haven't asked of users when
> > > writing a DoFn.
> > > 2) We "internal" methods within the
> > > RestrictionTracker that are not meant to be used by
> > > the runner.
> > >
> > > I can fix these issues by giving the user a
> > > forwarding restriction tracker[1] that provides an
> > > appropriate level of synchronization as needed and
> > > also provides the necessary observation hooks to
> see
> > > when a claim failed or succeeded.
> > >
> > > This requires a change to our experimental API
> since
> > > we need to pass a RestrictionTracker to the
> > > @ProcessElement method instead of a sub-type of
> > > RestrictionTracker.
> > > @ProcessElement
> > > processElement(ProcessContext context,
> > > OffsetRangeTracker tracker) { ... }
> > > becomes:
> > > @ProcessElement
> > > processElement(ProcessContext context,
> > > RestrictionTracker tracker) {
> ... }
> > >
> > > This provides an additional benefit that it
> prevents
> > > users from w

Re: SplittableDoFn

2018-10-03 Thread Ismaël Mejía
Hello Axel, Thanks for sharing, really interesting quest story, we
really need more like this (kudos for the animations too).
Are you planning to contribute the continous SDF based version of the
mongo connector into Beam upstream (once ready)?


On Wed, Oct 3, 2018 at 7:07 AM Jean-Baptiste Onofré  wrote:
>
> Nice one Alex !
>
> Thanks
> Regards
> JB
>
> On 02/10/2018 23:19, Alex Van Boxel wrote:
> > Don't want to crash the tech discussion here, but... I just gave a
> > session at the Beam Summit about Splittable DoFn's as a users
> > perspective (from things I could gather from the documentation and
> > experimentation). Her is the slides deck, maybe it could be
> > useful: 
> > https://docs.google.com/presentation/d/1dSc6oKh5pZItQPB_QiUyEoLT2TebMnj-pmdGipkVFPk/edit?usp=sharing
> >  (quite
> > proud of the animations though ;-)
> >
> >  _/
> > _/ Alex Van Boxel
> >
> >
> > On Thu, Sep 27, 2018 at 12:04 AM Lukasz Cwik  > > wrote:
> >
> > Reuven, just inside the restriction tracker itself which is scoped
> > per executing SplittableDoFn. A user could incorrectly write the
> > synchronization since they are currently responsible for writing it
> > though.
> >
> > On Wed, Sep 26, 2018 at 2:51 PM Reuven Lax  > > wrote:
> >
> > is synchronization over an entire work item, or just inside
> > restriction tracker? my concern is that some runners (especially
> > streaming runners) might have hundreds or thousands of parallel
> > work items being processed for the same SDF (for different
> > keys), and I'm afraid of creating lock-contention bottlenecks.
> >
> > On Fri, Sep 21, 2018 at 3:42 PM Lukasz Cwik  > > wrote:
> >
> > The synchronization is related to Java thread safety since
> > there is likely to be concurrent access needed to a
> > restriction tracker to properly handle accessing the backlog
> > and splitting concurrently from when the users DoFn is
> > executing and updating the restriction tracker. This is
> > similar to the Java thread safety needed in BoundedSource
> > and UnboundedSource for fraction consumed, backlog bytes,
> > and splitting.
> >
> > On Fri, Sep 21, 2018 at 2:38 PM Reuven Lax  > > wrote:
> >
> > Can you give details on what the synchronization is per?
> > Is it per key, or global to each worker?
> >
> > On Fri, Sep 21, 2018 at 2:10 PM Lukasz Cwik
> > mailto:lc...@google.com>> wrote:
> >
> > As I was looking at the SplittableDoFn API while
> > working towards making a proposal for how the
> > backlog/splitting API could look, I found some sharp
> > edges that could be improved.
> >
> > I noticed that:
> > 1) We require users to write thread safe code, this
> > is something that we haven't asked of users when
> > writing a DoFn.
> > 2) We "internal" methods within the
> > RestrictionTracker that are not meant to be used by
> > the runner.
> >
> > I can fix these issues by giving the user a
> > forwarding restriction tracker[1] that provides an
> > appropriate level of synchronization as needed and
> > also provides the necessary observation hooks to see
> > when a claim failed or succeeded.
> >
> > This requires a change to our experimental API since
> > we need to pass a RestrictionTracker to the
> > @ProcessElement method instead of a sub-type of
> > RestrictionTracker.
> > @ProcessElement
> > processElement(ProcessContext context,
> > OffsetRangeTracker tracker) { ... }
> > becomes:
> > @ProcessElement
> > processElement(ProcessContext context,
> > RestrictionTracker tracker) { ... }
> >
> > This provides an additional benefit that it prevents
> > users from working around the RestrictionTracker
> > APIs and potentially making underlying changes to
> > the tracker outside of the tryClaim call.
> >
> > Full implementation is available within this PR[2]
> > and was wondering what people thought.
> >
> > 1: 
> > https://github.com/apache/beam/pull/6467/files#diff-ed95abb6bc30a9ed07faef5c3fea93f0R72
> > 2: https://github.com/apache/beam/pull/64