Re: ** Configurable FluentBackoff for IO's **

2020-04-21 Thread Tim Robertson
I can answer for the case of SolrIO and ElasticsearchIO, Luke.
Retrying in SolrIO was my first contribution to Beam and I see in the PR
[1] that I was just copying JdbcIO for styling. ElasticsearchIO then
followed suit.

Exposing FluentBackoff seems sensible to me.

[1] https://github.com/apache/beam/pull/4905

On Tue, Apr 21, 2020 at 8:59 PM Akshay Iyangar  wrote:

> I’m ok either having something generic that calls the FluentBackoff
> internally or making FluentBackoff available for public. Whichever way the
> community decides I can change all IO’s to comply to the standard
>
>
>
> *From: *Alexey Romanenko 
> *Reply-To: *"dev@beam.apache.org" 
> *Date: *Tuesday, April 21, 2020 at 9:42 AM
> *To: *"dev@beam.apache.org" 
> *Subject: *Re: ** Configurable FluentBackoff for IO's **
>
>
>
> Notice: This email is from an external sender.
>
>
>
> I can guess it was done in this way to avoid using a class from
> “org.apache.beam.sdk.util” package.
>
>
>
> Can we just move FluentBackoff from “org.apache.beam.sdk.util” package to
> another package, available for users, or to create a common wrapper for
> such cases, like IO retries?
>
>
>
> On 17 Apr 2020, at 21:14, Luke Cwik  wrote:
>
>
>
>
>
>
>
> On Fri, Apr 17, 2020 at 9:57 AM Alexey Romanenko 
> wrote:
>
> As we can see, that support of Backoff in some way is quite demanded
> feature for different IOs. Of course, we don’t want to expose too many
> knobs but seems that this “backoff knob" should be able to be configured by
> user since it depends on different aspects of its environment.
>
>
>
> In the PR mentioned by Jonothan, we discussed that FluentBackoff was not
> exposed since it’s a part of “org.apache.beam.sdk.util” package which is
> for internal use only.
>
>
>
> Since many IOs already use this by wrapping it around own API classes, why
> not to make this FluentBackoff as a part of public API?
>
>
>
>
>
> That is what we are trying to answer. Why did those implementations decide
> to wrap it instead of exposing it.
>
>
>
> On 17 Apr 2020, at 17:16, Luke Cwik  wrote:
>
>
>
> Jonothan, you're still on point because exposing and/or using the client
> specific retry implementation is a valid strategy as it exposes all the
> knobs that a user may want to use.
>
> A downside I can see is that it may expose knobs that are irrelevant for
> the transform or makes it difficult to integrate other forms of retry that
> are specific to the transform outside of what the client library may do
> such as what to do with failed records being processed (retried, goto a
> DLQ, be dropped).
>
>
>
> Looking through the code for more examples, I see everyone rolling their
> own instead of exposing FluentBackoff or exposing client specific retry
> implementations:
>
> DynamoDBIO:
> https://github.com/apache/beam/blob/a1b79fdc995c869d1f32fab2e2004621b2d53988/sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/dynamodb/DynamoDBIO.java#L290
>
> ElasticSearchIO:
> https://github.com/apache/beam/blob/a1b79fdc995c869d1f32fab2e2004621b2d53988/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L937
>
> ClickHouseIO:
> https://github.com/apache/beam/blob/a1b79fdc995c869d1f32fab2e2004621b2d53988/sdks/java/io/clickhouse/src/main/java/org/apache/beam/sdk/io/clickhouse/ClickHouseIO.java#L258
>
>
>
> On Fri, Apr 17, 2020 at 8:14 AM Chamikara Jayalath 
> wrote:
>
> Another option might be to add explicitly defined retry policies to the
> API. For example, see following for BigQueryIO.
>
>
>
>
> https://github.com/apache/beam/blob/a1b79fdc995c869d1f32fab2e2004621b2d53988/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/InsertRetryPolicy.java
>
>
>
> On Thu, Apr 16, 2020 at 9:48 PM Akshay Iyangar 
> wrote:
>
> Luke
>
> I think for [2] and [3] it would be a fair statement that may be they
> wanted to add a custom retry configuration. But [2] looks very specific in
> the sense it doesn’t allow client to be more flexible [3] is something that
> I feel can be moved up and made generic enough.
>
>
>
> Jonothan
>
> Sorry for that, this was actually with regards to JdbcIO. My bad calling
> it S3.
>
>
>
>
>
>
>
> *From: *Jonothan Farr 
> *Reply-To: *"dev@beam.apache.org" 
> *Date: *Thursday, April 16, 2020 at 7:07 PM
> *To: *"dev@beam.apache.org" 
> *Subject: *Re: ** Configurable FluentBackoff for IO's **
>
>
>
> Notice: This email is from an external sender.
>
>
>
> Maybe this is a separate conversation, but for AWS IOs specifically
> wouldn't it be better to use the AWS client's retry policy? Something
> similar to this:
> ```
>   @Override
>   public AmazonS3ClientBuilder createBuilder(S3Options s3Options) {
> RetryPolicy retryPolicy = new RetryPolicy(
> PredefinedRetryPolicies.DEFAULT_RETRY_CONDITION,
> PredefinedRetryPolicies.DEFAULT_BACKOFF_STRATEGY,
> PredefinedRetryPolicies.DEFAULT_MAX_ERROR_RETRY,
> false);
> AmazonS3ClientBuilder builder =
>

Re: Beam 2.15.0 SparkRunner issues

2019-10-08 Thread Tim Robertson
I'm sorry for not replying. We are super busy trying to prepare data to
release.

An update:
- We were using G1GC and through slack were advised against that. This
fixed the OOM error we saw and all our 2.15.0 jobs did complete

When we have time (after 3 weeks) I'll try and isolate a test case with the
reshuffle example and parallelism.

Thanks,
Tim


On Thu, Oct 3, 2019 at 1:21 PM Jan Lukavský  wrote:

> Hi Tim,
>
> can you please elaborate more about some parts?
>
> 1) What happens actually in your case? What is the specific settings you
> use?
>
> 3) Can you share stacktrace? Is it always the same, or does it change?
>
> The mentioned GroupCombineFunctions.java:202 comes from a Reshuffle,
> which seems to make a little sense to me regarding the logic you
> described. Do you use Reshuffle transform or does it expand from some
> other transform?
>
> Jan
>
> On 10/3/19 9:24 AM, Tim Robertson wrote:
> > Hi all,
> >
> > We haven't dug enough into this to know where to log issues, but I'll
> > start by sharing here.
> >
> > After upgrading from Beam 2.10.0 to 2.15.0 we see issues on
> > SparkRunner - we suspect all of this related.
> >
> > 1. spark.default.parallelism is not respected
> >
> > 2. File writing (Avro) with dynamic destinations (grouped into folders
> > by a field name) consistently fail with
> > org.apache.beam.sdk.util.UserCodeException:
> > java.nio.file.FileAlreadyExistsException: Unable to rename resource
> >
> hdfs://ha-nn/pipelines/export-20190930-0854/.temp-beam-d4fd89ed-fc7a-4b1e-aceb-68f9d72d50f0/6e086f60-8bda-4d0e-b29d-1b47fdfc88c0
>
> > to
> >
> hdfs://ha-nn/pipelines/export-20190930-0854/7c9d2aec-f762-11e1-a439-00145eb45e9a/verbatimHBaseExport-0-of-1.avro
>
> > as destination already exists and couldn't be deleted.
> >
> > 3. GBK operations that run over 500M small records consistently fail
> > with OOM. We tried different configs with 48GB, 60GB, 80GB executor
> > memory
> >
> > Our pipelines run are batch, simple transformations with either an
> > HBaseSnapshot to Avro files or a merge of records in Avro (the GBK
> > issue) pushed to ElasticSearch (it fails upstream of the
> > ElasticsearchIO in the GBK stage).
> >
> > We notice operations that were mapToPair  in 2.10.0 become repartition
> > operations ( (mapToPair at GroupCombineFunctions.java:68 becomes
> > repartition at GroupCombineFunctions.java:202)) which might be related
> > to this and looks surprising.
> >
> > I'll report more as we learn. If anyone has any immediate ideas based
> > on their commits or reviews or if you wish an tests run on other Beam
> > versions please say.
> >
> > Thanks,
> > Tim
> >
> >
> >
>


Re: Live fixing of a Beam bug on July 25 at 3:30pm-4:30pm PST

2019-07-19 Thread Tim Sell
+1, I'd love to see this as a recording. Will you stick it up on youtube
afterwards?

On Thu, Jul 18, 2019 at 4:00 AM sridhar inuog 
wrote:

> Thanks, Pablo! Looking forward to it! Hopefully, it will also be recorded
> as well.
>
> On Wed, Jul 17, 2019 at 2:50 PM Pablo Estrada  wrote:
>
>> Yes! So I will be working on a small feature request for Java's
>> BigQueryIO: https://issues.apache.org/jira/browse/BEAM-7607
>>
>> Maybe I'll do something for Python next month. : )
>> Best
>> -P.
>>
>> On Wed, Jul 17, 2019 at 12:32 PM Rakesh Kumar 
>> wrote:
>>
>>> +1, I really appreciate this initiative. It would be really helpful
>>> newbies like me.
>>>
>>> Is it possible to list out what are the things that you are planning to
>>> cover?
>>>
>>>
>>>
>>>
>>> On Tue, Jul 16, 2019 at 11:19 AM Yichi Zhang  wrote:
>>>
 Thanks for organizing this Pablo, it'll be very helpful!

 On Tue, Jul 16, 2019 at 10:57 AM Pablo Estrada 
 wrote:

> Hello all,
> I'll be having a session where I live-fix a Beam bug for 1 hour next
> week. Everyone is invited.
>
> It will be on July 25, between 3:30pm and 4:30pm PST. Hopefully I will
> finish a full change in that time frame, but we'll see.
>
> I have not yet decided if I will do this via hangouts, or via a
> youtube livestream. In any case, I will share the link here in the next 
> few
> days.
>
> I will most likely work on the Java SDK (I have a little feature
> request in mind).
>
> Thanks!
> -P.
>



Re: [ANNOUNCE] New committer: Robert Burke

2019-07-17 Thread Tim Robertson
Congratulations Robert!

On Wed, Jul 17, 2019 at 2:47 PM Gleb Kanterov  wrote:

> Congratulations, Robert!
>
> On Wed, Jul 17, 2019 at 1:50 PM Robert Bradshaw 
> wrote:
>
>> Congratulations!
>>
>> On Wed, Jul 17, 2019, 12:56 PM Katarzyna Kucharczyk <
>> ka.kucharc...@gmail.com> wrote:
>>
>>> Congratulations! :)
>>>
>>> On Wed, Jul 17, 2019 at 12:46 PM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 Congratulations, Robert! :)

 On Wed, Jul 17, 2019 at 12:45 PM Łukasz Gajowy 
 wrote:

> Congratulations! :)
>
> śr., 17 lip 2019 o 04:30 Rakesh Kumar 
> napisał(a):
>
>> Congrats Rob!!!
>>
>> On Tue, Jul 16, 2019 at 10:24 AM Ahmet Altay 
>> wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Robert Burke.
>>>
>>> Robert has been contributing to Beam and actively involved in the
>>> community for over a year. He has been actively working on Go SDK, 
>>> helping
>>> users, and making it easier for others to contribute [1].
>>>
>>> In consideration of Robert's contributions, the Beam PMC trusts him
>>> with the responsibilities of a Beam committer [2].
>>>
>>> Thank you, Robert, for your contributions and looking forward to
>>> many more!
>>>
>>> Ahmet, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/8f729da2d3009059d7a8b2d8624446be161700dcfa953939dd3530c6@%3Cdev.beam.apache.org%3E
>>> [2] https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>

 --

 Michał Walenia
 Polidea  | Software Engineer

 M: +48 791 432 002 <+48791432002>
 E: michal.wale...@polidea.com

 Unique Tech
 Check out our projects! 

>>>
>
> --
> Cheers,
> Gleb
>


Re: jobs not started

2019-06-27 Thread Tim Robertson
Hi Chaim,

To help you we'd need a little more detail I think - what environment,
runner, how you launch your jobs etc.

My first impression is that is sounds more like an environment related
thing rather than a Beam codebase issue. If it is a DataFlow environment I
expect you might need to explore the helpdesk of dataflow. I notice for
example others report this on SO
https://stackoverflow.com/questions/30189691/dataflow-zombie-jobs-stuck-in-not-started-state

I hope this is somewhat useful,
Tim

On Thu, Jun 27, 2019 at 8:12 AM Chaim Turkel  wrote:

> since the night all my jobs that i run are stuck in not started, and ideas
> why?
> chaim
>
> --
>
>
> Loans are funded by
> FinWise Bank, a Utah-chartered bank located in Sandy,
> Utah, member FDIC, Equal
> Opportunity Lender. Merchant Cash Advances are
> made by Behalf. For more
> information on ECOA, click here
> <https://www.behalf.com/legal/ecoa/>. For important information about
> opening a new
> account, review Patriot Act procedures here
> <https://www.behalf.com/legal/patriot/>.
> Visit Legal
> <https://www.behalf.com/legal/> to
> review our comprehensive program terms,
> conditions, and disclosures.
>


Re: [ANNOUNCE] New committer: Mikhail Gryzykhin

2019-06-21 Thread Tim Robertson
Congratulations Mikhail!

On Fri, Jun 21, 2019 at 12:37 PM Robert Burke  wrote:

> Congrats
>
> On Fri, Jun 21, 2019, 12:29 PM Thomas Weise  wrote:
>
>> Hi,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Mikhail Gryzykhin.
>>
>> Mikhail has been contributing to Beam and actively involved in the
>> community for over a year. He developed the community build dashboard [1]
>> and added substantial improvements to our build infrastructure. Mikhail's
>> work also covers metrics, contributor documentation, development process
>> improvements and other areas.
>>
>> In consideration of Mikhail's contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer [2].
>>
>> Thank you, Mikhail, for your contributions and looking forward to many
>> more!
>>
>> Thomas, on behalf of the Apache Beam PMC
>>
>> [1] https://s.apache.org/beam-community-metrics
>> [2]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>>


Re: [ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Tim Robertson
This is great. Thanks Pablo and all

I've seen several folk struggle with writing avro to dynamic locations
which I think might be a good addition. If you agree I'll offer a PR unless
someone gets there first - I have an example here:

https://github.com/gbif/pipelines/blob/master/pipelines/export-gbif-hbase/src/main/java/org/gbif/pipelines/hbase/beam/ExportHBase.java#L81


On Fri, Jun 7, 2019 at 10:52 PM Pablo Estrada  wrote:

> Hello everyone,
> A group of community members has been working on gathering and providing
> common pipeline patterns for pipelines in Beam. These are examples on how
> to perform certain operations, and useful ways of using Beam in your
> pipelines. Some of them relate to processing of files, use of side inputs,
> sate/timers, etc. Check them out[1].
>
> These initial patterns have been chosen based on evidence gathered from
> StackOverflow, and from talking to users of Beam.
>
> It would be great if this section could grow, and be useful to many Beam
> users. For that reason, we invite anyone to share patterns, and pipeline
> examples that they have used in the past. If you are interested in
> contributing, please submit a pull request, or get in touch with Cyrus
> Maden, Reza Rokni, Melissa Pashniak or myself.
>
> Thanks!
> Best
> -P.
>
> [1] https://beam.apache.org/documentation/patterns/overview/
>


Re: [ANNOUNCE] New committer announcement: Mark Liu

2019-05-09 Thread Tim Robertson
Congratulations Mark!

On Thu, May 9, 2019 at 10:16 PM David Morávek 
wrote:

> Congrats!
>
> D.
>
> Sent from my iPhone
>
> On 9 May 2019, at 10:07, Reuven Lax  wrote:
>
> Congratulations!
>
> On Thu, May 9, 2019 at 5:14 AM Etienne Chauchot 
> wrote:
>
>> Congrats !
>> Le lundi 25 mars 2019 à 10:55 -0700, Chamikara Jayalath a écrit :
>>
>> Congrats Mark!
>>
>> On Mon, Mar 25, 2019 at 10:50 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>> Congratulations, Mark!
>>
>> On 25 Mar 2019, at 18:36, Mark Liu  wrote:
>>
>> Thank you all! It's a great pleasure to work on Beam!
>>
>> Mark
>>
>> On Mon, Mar 25, 2019 at 10:18 AM Robin Qiu  wrote:
>>
>> Congratulations, Mark!
>>
>> On Mon, Mar 25, 2019 at 9:31 AM Udi Meiri  wrote:
>>
>> Congrats Mark!
>>
>> On Mon, Mar 25, 2019 at 9:24 AM Ahmet Altay  wrote:
>>
>> Congratulations, Mark! 
>>
>> On Mon, Mar 25, 2019 at 7:24 AM Tim Robertson 
>> wrote:
>>
>> Congratulations Mark!
>>
>>
>> On Mon, Mar 25, 2019 at 3:18 PM Michael Luckey 
>> wrote:
>>
>> Nice! Congratulations, Mark.
>>
>> On Mon, Mar 25, 2019 at 2:42 PM Katarzyna Kucharczyk <
>> ka.kucharc...@gmail.com> wrote:
>>
>> Congratulations, Mark! 
>>
>> On Mon, Mar 25, 2019 at 11:24 AM Gleb Kanterov  wrote:
>>
>> Congratulations!
>>
>> On Mon, Mar 25, 2019 at 10:23 AM Łukasz Gajowy 
>> wrote:
>>
>> Congrats! :)
>>
>>
>>
>> pon., 25 mar 2019 o 08:11 Aizhamal Nurmamat kyzy 
>> napisał(a):
>>
>> Congratulations, Mark!
>>
>> On Sun, Mar 24, 2019 at 23:18 Pablo Estrada  wrote:
>>
>> Yeaah  Mark! : ) Congrats : D
>>
>> On Sun, Mar 24, 2019 at 10:32 PM Yifan Zou  wrote:
>>
>> Congratulations Mark!
>>
>> On Sun, Mar 24, 2019 at 10:25 PM Connell O'Callaghan 
>> wrote:
>>
>> Well done congratulations Mark!!!
>>
>> On Sun, Mar 24, 2019 at 10:17 PM Robert Burke  wrote:
>>
>> Congratulations Mark! 
>>
>> On Sun, Mar 24, 2019, 10:08 PM Valentyn Tymofieiev 
>> wrote:
>>
>> Congratulations, Mark!
>>
>> Thanks for your contributions, in particular for your efforts to
>> parallelize test execution for Python SDK and increase the speed of Python
>> precommit checks.
>>
>> On Sun, Mar 24, 2019 at 9:40 PM Kenneth Knowles  wrote:
>>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Mark Liu.
>>
>> Mark has been contributing to Beam since late 2016! He has proposed 100+
>> pull requests. Mark was instrumental in expanding test and infrastructure
>> coverage, especially for Python. In consideration of Mark's
>> contributions, the Beam PMC trusts Mark with the responsibilities of a Beam
>>  committer [1].
>>
>> Thank you, Mark, for your contributions.
>>
>> Kenn
>>
>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam
>> -committer
>>
>> --
>>
>> *Aizhamal Nurmamat kyzy*
>> Open Source Program Manager
>> 646-355-9740 Mobile
>> 601 North 34th Street, Seattle, WA 98103
>>
>>
>>
>>


Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-25 Thread Tim Robertson
Thank you for running the release Andrew

On Thu, Apr 25, 2019 at 8:24 PM Andrew Pilloud  wrote:

> I reran the Nexmark tests, each runner passed. I compared the numbers
> on the direct runner to the dashboard and they are where they should
> be.
>
> With that, I'm happy to announce that we have unanimously approved this
> release.
>
> There are 8 approving votes, 4 of which are binding:
> * Jean-Baptiste Onofré
> * Lukasz Cwik
> * Maximilian Michels
> * Ahmet Altay
>
> There are no disapproving votes.
>
> Thanks everyone!
>


Re: [ANNOUNCE] New committer announcement: Yifan Zou

2019-04-22 Thread Tim Robertson
Congratulations Yifan!

On Mon, Apr 22, 2019 at 5:39 PM Cyrus Maden  wrote:

> Congratulations Yifan!!
>
> On Mon, Apr 22, 2019 at 11:26 AM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Yifan Zou.
>>
>> Yifan has been contributing to Beam since early 2018. He has proposed
>> 70+ pull requests, adding dependency checking and improving test
>> infrastructure. But something the numbers cannot show adequately is the
>> huge effort Yifan has put into working with infra and keeping our Jenkins
>> executors healthy.
>>
>> In consideration of Yian's contributions, the Beam PMC trusts Yifan with
>> the responsibilities of a Beam committer [1].
>>
>> Thank you, Yifan, for your contributions.
>>
>> Kenn
>>
>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam
>> -committer
>>
>


Re: [ANNOUNCE] New committer announcement: Boyuan Zhang

2019-04-11 Thread Tim Robertson
Many congratulations Boyuan!

On Thu, Apr 11, 2019 at 10:50 AM Łukasz Gajowy  wrote:

> Congrats Boyuan! :)
>
> śr., 10 kwi 2019 o 23:49 Chamikara Jayalath 
> napisał(a):
>
>> Congrats Boyuan!
>>
>> On Wed, Apr 10, 2019 at 11:14 AM Yifan Zou  wrote:
>>
>>> Congratulations Boyuan!
>>>
>>> On Wed, Apr 10, 2019 at 10:49 AM Daniel Oliveira 
>>> wrote:
>>>
 Congrats Boyuan!

 On Wed, Apr 10, 2019 at 10:20 AM Rui Wang  wrote:

> So well deserved!
>
> -Rui
>
> On Wed, Apr 10, 2019 at 10:12 AM Pablo Estrada 
> wrote:
>
>> Well deserved : ) congrats Boyuan!
>>
>> On Wed, Apr 10, 2019 at 10:08 AM Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>>
>>> Congratulations Boyuan!
>>>
>>> On Wed, Apr 10, 2019 at 9:52 AM Ruoyun Huang 
>>> wrote:
>>>
 Thanks for your contributions and congratulations Boyuan!

 On Wed, Apr 10, 2019 at 9:00 AM Kenneth Knowles 
 wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Boyuan Zhang.
>
> Boyuan has been contributing to Beam since early 2018. She has
> proposed 100+ pull requests across a wide range of topics: bug fixes, 
> to
> integration tests, build improvements, metrics features, release
> automation. Two big picture things to highlight are 
> building/releasing Beam
> Python wheels and managing the donation of the Beam Dataflow Java 
> Worker,
> including help with I.P. clearance.
>
> In consideration of Boyuan's contributions, the Beam PMC trusts
> Boyuan with the responsibilities of a Beam committer [1].
>
> Thank you, Boyuan, for your contributions.
>
> Kenn
>
> [1] https://beam.apache.org/contribute/become-a-committer/
> #an-apache-beam-committer
>


 --
 
 Ruoyun  Huang




Re: [ANNOUNCE] New committer announcement: Mark Liu

2019-03-25 Thread Tim Robertson
Congratulations Mark!


On Mon, Mar 25, 2019 at 3:18 PM Michael Luckey  wrote:

> Nice! Congratulations, Mark.
>
> On Mon, Mar 25, 2019 at 2:42 PM Katarzyna Kucharczyk <
> ka.kucharc...@gmail.com> wrote:
>
>> Congratulations, Mark! 
>>
>> On Mon, Mar 25, 2019 at 11:24 AM Gleb Kanterov  wrote:
>>
>>> Congratulations!
>>>
>>> On Mon, Mar 25, 2019 at 10:23 AM Łukasz Gajowy 
>>> wrote:
>>>
 Congrats! :)



 pon., 25 mar 2019 o 08:11 Aizhamal Nurmamat kyzy 
 napisał(a):

> Congratulations, Mark!
>
> On Sun, Mar 24, 2019 at 23:18 Pablo Estrada 
> wrote:
>
>> Yeaah  Mark! : ) Congrats : D
>>
>> On Sun, Mar 24, 2019 at 10:32 PM Yifan Zou 
>> wrote:
>>
>>> Congratulations Mark!
>>>
>>> On Sun, Mar 24, 2019 at 10:25 PM Connell O'Callaghan <
>>> conne...@google.com> wrote:
>>>
 Well done congratulations Mark!!!

 On Sun, Mar 24, 2019 at 10:17 PM Robert Burke 
 wrote:

> Congratulations Mark! 
>
> On Sun, Mar 24, 2019, 10:08 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congratulations, Mark!
>>
>> Thanks for your contributions, in particular for your efforts to
>> parallelize test execution for Python SDK and increase the speed of 
>> Python
>> precommit checks.
>>
>> On Sun, Mar 24, 2019 at 9:40 PM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Mark Liu.
>>>
>>> Mark has been contributing to Beam since late 2016! He has
>>> proposed 100+ pull requests. Mark was instrumental in expanding 
>>> test and
>>> infrastructure coverage, especially for Python. In
>>> consideration of Mark's contributions, the Beam PMC trusts Mark 
>>> with the
>>> responsibilities of a Beam committer [1].
>>>
>>> Thank you, Mark, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer/
>>> #an-apache-beam-committer
>>>
>> --
>
> *Aizhamal Nurmamat kyzy*
>
> Open Source Program Manager
>
> 646-355-9740 Mobile
>
> 601 North 34th Street, Seattle, WA 98103
>
>
>
>>>
>>> --
>>> Cheers,
>>> Gleb
>>>
>>


Re: [ANNOUNCE] New committer announcement: Raghu Angadi

2019-03-07 Thread Tim Robertson
Congrats Raghu

On Thu, Mar 7, 2019 at 7:09 PM Ahmet Altay  wrote:

> Congratulations!
>
> On Thu, Mar 7, 2019 at 10:08 AM Ruoyun Huang  wrote:
>
>> Thank you Raghu for your contribution!
>>
>>
>>
>> On Thu, Mar 7, 2019 at 9:58 AM Connell O'Callaghan 
>> wrote:
>>
>>> Congratulation Raghu!!! Thank you for sharing Kenn!!!
>>>
>>> On Thu, Mar 7, 2019 at 9:55 AM Ismaël Mejía  wrote:
>>>
 Congrats !

 Le jeu. 7 mars 2019 à 17:09, Aizhamal Nurmamat kyzy <
 aizha...@google.com> a écrit :

> Congratulations, Raghu!!!
> On Thu, Mar 7, 2019 at 08:07 Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Raghu Angadi
>>
>> Raghu has been contributing to Beam since early 2016! He has
>> continuously improved KafkaIO and supported on the user@ list but
>> his community contributions are even more extensive, including reviews, 
>> dev@
>> list discussions, improvements and ideas across SqsIO, FileIO, PubsubIO,
>> and the Dataflow and Samza runners. In consideration of Raghu's
>> contributions, the Beam PMC trusts Raghu with the responsibilities of a
>> Beam committer [1].
>>
>> Thank you, Raghu, for your contributions.
>>
>> Kenn
>>
>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-
>> beam-committer
>>
>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>


Re: [ANNOUNCE] New committer announcement: Michael Luckey

2019-02-27 Thread Tim Robertson
Congrats Michael and welcome.

On Thu, Feb 28, 2019 at 7:41 AM Gleb Kanterov  wrote:

> Congratulations and welcome!
>
> On Wed, Feb 27, 2019 at 8:57 PM Connell O'Callaghan 
> wrote:
>
>> Excellent thank you for sharing Kenn!!!
>>
>> Michael congratulations for this recognition of your contributions to
>> advancing BEAM
>>
>> On Wed, Feb 27, 2019 at 11:52 AM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>>> Michael Luckey
>>>
>>> Michael has been contributing to Beam since early 2017. He has fixed
>>> many build and developer environment issues, noted and root-caused
>>> breakages on master, generously reviewed many others' changes to the build. 
>>> In
>>> consideration of Michael's contributions, the Beam PMC trusts Michael with
>>> the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Michael, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-
>>> beam-committer
>>>
>>
>
> --
> Cheers,
> Gleb
>


Re: Signing off

2019-02-14 Thread Tim Robertson
What a shame for the project but best of luck for the future Scott.

Thanks for all your contributions - they have been significant!
Tim

On Thu, Feb 14, 2019 at 7:37 PM Scott Wegner  wrote:

> I wanted to let you all know that I've decided to pursue a new adventure
> in my career, which will take me away from Apache Beam development.
>
> It's been a fun and fulfilling journey. Apache Beam has been my first
> significant experience working in open source. I'm inspired observing how
> the community has come together to deliver something great.
>
> Thanks for everything. If you're curious what's next: I'll be working on
> Federated Learning at Google:
> https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
>
> Take care,
> Scott
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: [ANNOUNCE] New PMC member: Etienne Chauchot

2019-01-25 Thread Tim
Congratulations Etienne!

Tim

> On 25 Jan 2019, at 23:00, Kenneth Knowles  wrote:
> 
> Hi all,
> 
> Please join me and the rest of the Beam PMC in welcoming Etienne Chauchot to 
> join the PMC.
> 
> Etienne introduced himself to dev@ in September of 2017 and over the years 
> has contributed to Beam in many ways - connectors, performance, design 
> discussion, talks, code reviews, and I'm sure I cannot list them all. He 
> already has a major impact on the direction of Beam.
> 
> Thanks for being a part of Beam, Etienne!
> 
> Kenn


Re: BEAM-6324 / #7340: "I've pretty much given up on the PR being merged. I use my own fork for my projects"

2019-01-25 Thread Tim Robertson
Thanks Kenn

I tend to think that timing is the main contributing factor as you note on
the Jira - it slipped down with no reminders / bumps sent on any channels
that I can see.

Would something that alerts the dev@ list of PRs that have not received any
attention after N days be helpful perhaps?
Even if that only prompts action by one of us to comment on the PR that
it's been acknowledged would likely be enough to engage the contributor -
they would hopefully then ping the individual if it then slips for a long
time.

Next week will be my first I'll be able to work on Beam in 2019, but I'll
comment on that PR now too as it's missing tests.





On Fri, Jan 25, 2019 at 7:27 AM Kenneth Knowles  wrote:

> The subject line is a quote from BEAM-6324*
>
> This makes me sad. I hope/expect it is a failure to route a pull request
> to the right reviewer. I am less sad about the functionality than the
> sentiment and how a contributor is being discouraged.
>
> Does anyone have ideas that could help?
>
> Kenn
>
> *https://issues.apache.org/jira/browse/BEAM-6324
>


Re: [ANNOUNCE] New committer announcement: Gleb Kanterov

2019-01-25 Thread Tim Robertson
Welcome Gleb and congratulations!

On Fri, Jan 25, 2019 at 8:06 AM Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer: Gleb
> Kanterov
>
> Gleb started contributing to Beam and quickly dove deep, doing some
> sensitive fixes to schemas, also general build issues, Beam SQL, Avro, and
> more. In consideration of Gleb's technical and community contributions, the
> Beam PMC trusts Gleb with the responsibilities of a Beam committer [1].
>
> Thank you, Gleb, for your contributions.
>
> Kenn
>
> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
> committer
>


Re: [ANNOUNCE] Apache Beam 2.9.0 released!

2018-12-14 Thread Tim
Thank you for running the release Chamikara.

Tim,
Sent from my iPhone

> On 14 Dec 2018, at 10:30, Matt Casters  wrote:
> 
> Great news! Congratulations!
> My experience venturing into the world of Apache Beam couldn't possibly have 
> been nicer.  Thank you to all involved!
> ---
> Matt
> 
> 
> Op vr 14 dec. 2018 om 04:42 schreef Chamikara Jayalath :
>> The Apache Beam team is pleased to announce the release of version 2.9.0!
>> 
>> Apache Beam is an open source unified programming model to define and
>> execute data processing pipelines, including ETL, batch and stream
>> (continuous) processing. See https://beam.apache.org
>> 
>> You can download the release here:
>> 
>> https://beam.apache.org/get-started/downloads/
>> 
>> This release includes the following major new features & improvements. 
>> Please see the blog post for more details: 
>> https://beam.apache.org/blog/2018/12/13/beam-2.9.0.html
>> 
>> Thanks to everyone who contributed to this release, and we hope you enjoy 
>> using Beam 2.9.0.
>> -- Chamikara Jayalath, on behalf of The Apache Beam team


Re: [PROPOSAL] Prepare Beam 2.9.0 release

2018-11-15 Thread Tim
Thanks Cham
+1

> On 16 Nov 2018, at 05:30, Thomas Weise  wrote:
> 
> +1
> 
> 
>> On Thu, Nov 15, 2018 at 4:34 PM Ahmet Altay  wrote:
>> +1 Thank you.
>> 
>>> On Thu, Nov 15, 2018 at 4:22 PM, Kenneth Knowles  wrote:
>>> SGTM. Thanks for keeping track of the schedule.
>>> 
>>> Kenn
>>> 
 On Thu, Nov 15, 2018 at 1:59 PM Chamikara Jayalath  
 wrote:
 Hi All,
 
 According to the release calendar [1] branch cut date for Beam 2.9.0 
 release is 11/21/2018. Since previous release branch was cut close to the 
 respective calendar date I'd like to propose cutting release branch for 
 2.9.0 on 11/21/2018. Next week is Thanksgiving holiday in US and possibly 
 some folks will be out so we can try to produce RC1 on Monday after 
 (11/26/2018). We can attend to current blocker JIRAs [2] in the meantime. 
 
 I'd like to volunteer to perform this release.
 
 WDYT ?
 
 Thanks,
 Cham
 
 [1] 
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com=America%2FLos_Angeles
 [2] https://s.apache.org/beam-2.9.0-burndown
 
>> 


Re: Stackoverflow Questions

2018-11-05 Thread Tim Robertson
Thanks for raising this Anton


>  It would be very easy to forward new SO questions to the user@ list, or
> a new list if we're worried about the noise.


+1 (preference on user@ until there are too many)



On Mon, Nov 5, 2018 at 7:18 PM Scott Wegner  wrote:

> I like the idea of working to improve the our presence on Q sites like
> StackOverflow. SO is a great resource and much more discoverable /
> searchable than a mail archive.
>
> One idea on how to improve our presence: StackOverflow supports setting up
> email subscriptions [1] for particular tags. It would be very easy to
> forward new SO questions to the user@ list, or a new list if we're
> worried about the noise.
>
> [1] https://stackexchange.com/filters/new
>
> On Mon, Nov 5, 2018 at 9:54 AM Jean-Baptiste Onofré 
> wrote:
>
>> That's "classic" in the Apache projects. And yes, most of the time, we
>> periodically send or ask the dev to check the questions on other
>> channels like stackoverflow.
>>
>> It makes sense to send a reminder or a list of open questions on the
>> user mailing list (users can help each other too).
>>
>> Regards
>> JB
>>
>> On 05/11/2018 18:25, Anton Kedin wrote:
>> > Hi dev@,
>> >
>> > I was looking at stackoverflow questions tagged with `apache-beam` [1]
>> > and wanted to ask your opinion. It feels like it's easier for some users
>> > to ask questions on stackoverflow than on user@. Overall frequency
>> > between the two channels seems comparable but a lot of stackoverflow
>> > questions are not answered while questions on user@ get some attention
>> > most of the time. Would it make sense to increase dev@ visibility into
>> > stackoverflow, e.g. by sending periodic digest or some other way?
>> >
>> > [1] https://stackoverflow.com/questions/tagged/apache-beam
>> >
>> > Regards,
>> > Anton
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: [ANNOUNCE] New committer announcement, Euphoria edition

2018-11-01 Thread Tim
Congratulations and welcome!

Tim

> On 1 Nov 2018, at 17:06, Matthias Baetens  wrote:
> 
> Congrats David!!!
> 
>> On Thu, Nov 1, 2018, 16:04 Kenneth Knowles  wrote:
>> Hi all,
>> 
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> 
>>  - David Morávek: of, but not limited to, the new Euphoria API
>> 
>> Through his work with us merging the Euphoria API, community outreach, and 
>> other contributions to Beam, the PMC trusts David with the responsibilities 
>> of a Beam committer [1].
>> 
>> Kenn
>> 
>> [1] 
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>> 
> -- 
>  


Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-26 Thread Tim Robertson
A colleague and I tested on 2.7.0 and 2.8.0RC1:

1. Quickstart on Spark/YARN/HDFS (CDH 5.12.0) (commented in spreadsheet)
2. Our Avro to Avro pipelines on Spark/YARN/HDFS (note we backport the
un-merged BEAM-5036 fix in our code)
3. Our Avro to Elasticsearch pipelines on Spark/YARN/HDFS

Everything worked, and performance was similar on both.
We built using maven pointing at
https://repository.apache.org/content/repositories/orgapachebeam-1049/

Based on this limited testing: +1

Thank you to the release managers,
Tim


On Thu, Oct 25, 2018 at 7:21 PM Tim  wrote:

> I can do some tests on Spark / YARN tomorrow (CEST timezone). Sorry I’ve
> just been too busy to assist.
>
> Tim
>
> On 25 Oct 2018, at 18:59, Kenneth Knowles  wrote:
>
> I tried to do a more thorough job on this.
>
>  - I could not reproduce the slowdown in Query 9. I believe the variance
> was simply high given the parameters and environment
>  - I saw the same slowdown in Query 8 when running as part of the suite,
> but it vanished when I ran repeatedly on its own, so again it is not good
> methodology probably
>
> We do have the dashboard at
> https://apache-beam-testing.appspot.com/dashboard-admin though no anomaly
> detection set up AFAIK.
>
>  - There is no issue easily visible in DirectRunner:
> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424
>  - There is a notable degradation in Spark runner on 10/5 for many
> queries.
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
>  - Something minor happened for Dataflow around 10/1:
> https://apache-beam-testing.appspot.com/explore?dashboard=5670405876482048
>  - Flink runner seems to have had some fantastic improvements :-)
> https://apache-beam-testing.appspot.com/explore?dashboard=5699257587728384
>
> So if there is a blocker it would really be the Spark runner perf changes.
> Of course, all these except Dataflow are using local instances so may not
> be representative of larger scale AFAIK.
>
> Kenn
>
> On Wed, Oct 24, 2018 at 9:48 AM Maximilian Michels  wrote:
>
>> I've run WordCount using Quickstart with the FlinkRunner (locally and
>> against a Flink cluster).
>>
>> Would give a +1 but waiting what Kenn finds.
>>
>> -Max
>>
>> On 23.10.18 07:11, Ahmet Altay wrote:
>> >
>> >
>> > On Mon, Oct 22, 2018 at 10:06 PM, Kenneth Knowles > > <mailto:k...@apache.org>> wrote:
>> >
>> > You two did so much verification I had a hard time finding something
>> > where my help was meaningful! :-)
>> >
>> > I did run the Nexmark suite on the DirectRunner against 2.7.0 and
>> > 2.8.0 following
>> >
>> https://beam.apache.org/documentation/sdks/java/nexmark/#running-smoke-suite-on-the-directrunner-local
>> > <
>> https://beam.apache.org/documentation/sdks/java/nexmark/#running-smoke-suite-on-the-directrunner-local
>> >.
>> >
>> > It is admittedly a very silly test - the instructions leave
>> > immutability enforcement on, etc. But it does appear that there is a
>> > 30% degradation in query 8 and 15% in query 9. These are the pure
>> > Java tests, not the SQL variants. The rest of the queries are close
>> > enough that differences are not meaningful.
>> >
>> >
>> > (It would be a good improvement for us to have alerts on daily
>> > benchmarks if we do not have such a concept already.)
>> >
>> >
>> > I would ask a little more time to see what is going on here - is it
>> > a real performance issue or an artifact of how the tests are
>> > invoked, or ...?
>> >
>> >
>> > Thank you! Much appreciated. Please let us know when you are done with
>> > your investigation.
>> >
>> >
>> > Kenn
>> >
>> > On Mon, Oct 22, 2018 at 6:20 PM Ahmet Altay > > <mailto:al...@google.com>> wrote:
>> >
>> > Hi all,
>> >
>> > Did you have a chance to review this RC? Between me and Robert
>> > we ran a significant chunk of the validations. Let me know if
>> > you have any questions.
>> >
>> > Ahmet
>> >
>> > On Thu, Oct 18, 2018 at 5:26 PM, Ahmet Altay > > <mailto:al...@google.com>> wrote:
>> >
>> > Hi everyone,
>> >
>> > Please review and vote on the release candidate #1 for the
>> > version 2.8.0, as follows:
>> > [ ] +1, Approve the relea

Re: [VOTE] Release 2.8.0, release candidate #1

2018-10-25 Thread Tim
I can do some tests on Spark / YARN tomorrow (CEST timezone). Sorry I’ve just 
been too busy to assist.

Tim

> On 25 Oct 2018, at 18:59, Kenneth Knowles  wrote:
> 
> I tried to do a more thorough job on this.
> 
>  - I could not reproduce the slowdown in Query 9. I believe the variance was 
> simply high given the parameters and environment
>  - I saw the same slowdown in Query 8 when running as part of the suite, but 
> it vanished when I ran repeatedly on its own, so again it is not good 
> methodology probably
> 
> We do have the dashboard at 
> https://apache-beam-testing.appspot.com/dashboard-admin though no anomaly 
> detection set up AFAIK.
> 
>  - There is no issue easily visible in DirectRunner: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5084698770407424
>  - There is a notable degradation in Spark runner on 10/5 for many queries. 
> https://apache-beam-testing.appspot.com/explore?dashboard=5138380291571712
>  - Something minor happened for Dataflow around 10/1: 
> https://apache-beam-testing.appspot.com/explore?dashboard=5670405876482048
>  - Flink runner seems to have had some fantastic improvements :-) 
> https://apache-beam-testing.appspot.com/explore?dashboard=5699257587728384
> 
> So if there is a blocker it would really be the Spark runner perf changes. Of 
> course, all these except Dataflow are using local instances so may not be 
> representative of larger scale AFAIK.
> 
> Kenn
> 
>> On Wed, Oct 24, 2018 at 9:48 AM Maximilian Michels  wrote:
>> I've run WordCount using Quickstart with the FlinkRunner (locally and 
>> against a Flink cluster).
>> 
>> Would give a +1 but waiting what Kenn finds.
>> 
>> -Max
>> 
>> On 23.10.18 07:11, Ahmet Altay wrote:
>> > 
>> > 
>> > On Mon, Oct 22, 2018 at 10:06 PM, Kenneth Knowles > > <mailto:k...@apache.org>> wrote:
>> > 
>> > You two did so much verification I had a hard time finding something
>> > where my help was meaningful! :-)
>> > 
>> > I did run the Nexmark suite on the DirectRunner against 2.7.0 and
>> > 2.8.0 following
>> > 
>> > https://beam.apache.org/documentation/sdks/java/nexmark/#running-smoke-suite-on-the-directrunner-local
>> > 
>> > <https://beam.apache.org/documentation/sdks/java/nexmark/#running-smoke-suite-on-the-directrunner-local>.
>> > 
>> > It is admittedly a very silly test - the instructions leave
>> > immutability enforcement on, etc. But it does appear that there is a
>> > 30% degradation in query 8 and 15% in query 9. These are the pure
>> > Java tests, not the SQL variants. The rest of the queries are close
>> > enough that differences are not meaningful.
>> > 
>> > 
>> > (It would be a good improvement for us to have alerts on daily 
>> > benchmarks if we do not have such a concept already.)
>> > 
>> > 
>> > I would ask a little more time to see what is going on here - is it
>> > a real performance issue or an artifact of how the tests are
>> > invoked, or ...?
>> > 
>> > 
>> > Thank you! Much appreciated. Please let us know when you are done with 
>> > your investigation.
>> > 
>> > 
>> > Kenn
>> > 
>> > On Mon, Oct 22, 2018 at 6:20 PM Ahmet Altay > > <mailto:al...@google.com>> wrote:
>> > 
>> > Hi all,
>> > 
>> > Did you have a chance to review this RC? Between me and Robert
>> > we ran a significant chunk of the validations. Let me know if
>> > you have any questions.
>> > 
>> > Ahmet
>> > 
>> > On Thu, Oct 18, 2018 at 5:26 PM, Ahmet Altay > > <mailto:al...@google.com>> wrote:
>> > 
>> > Hi everyone,
>> > 
>> > Please review and vote on the release candidate #1 for the
>> > version 2.8.0, as follows:
>> > [ ] +1, Approve the release
>> > [ ] -1, Do not approve the release (please provide specific
>> > comments)
>> > 
>> > The complete staging area is available for your review,
>> > which includes:
>> > * JIRA release notes [1],
>> > * the official Apache source release to be deployed to
>> > dist.apache.org <http://dist.apache.org> [2], which is
>> > signed with the key w

Re: [ANNOUNCE] New committers & PMC members, Summer 2018 edition

2018-10-17 Thread Tim Robertson
>
> Great to see the community growing!


Indeed - congratulations to everyone and from my side, thank you for the
recognition. I'm looking forward to working more closely with you all.



On Wed, Oct 17, 2018 at 11:39 AM Maximilian Michels  wrote:

> Great to see the community growing!
>
> On 16.10.18 18:20, Scott Wegner wrote:
> > Congrats all! And thanks Kenn and the PMC for recognizing these
> > contributions.
> >
> > On Mon, Oct 15, 2018 at 9:45 AM Kenneth Knowles  > <mailto:k...@apache.org>> wrote:
> >
> > Hi all,
> >
> > Since our last announcement in May, we have added many more
> > committers and a new PMC member. Some of these may have been in the
> > monthly newsletter or mentioned elsewhere, but I wanted to be sure
> > to have a loud announcement on the list directly.
> >
> > Please join me in belatedly welcoming...
> >
> > New PMC member: Thomas Weise
> >   - Author of the ApexRunner, the first additional runner after Beam
> > began incubation.
> >   - Recently heavily involved in Python-on-Flink efforts.
> >   - Outside his contributions to Beam, Thomas is PMC chair for
> > Apache Apex.
> >
> > New committers:
> >
> >   - Charles Chen, longtime contributor to Python SDK, Python direct
> > runner, state & timers
> >   - Łukasz  Gajowy, testing infrastructure, benchmarks, build system
> > improvements
> >   - Anton Kedin, contributor to SQL and schemas, helper on
> StackOverflow
> >   - Andrew Pilloud, contributor to SQL, very active on dev@, infra
> > and release help
> >   - Tim Robertson, contributor to many IOs, major code health work
> >   - Alexey Romanenko, contributor to many IOs, Nexmark benchmarks
> >   - Henning Rohde, contributor to Go SDK, incl. ip fun, and
> > portability protos and design
> >   - Scott Wegner, one of our longest contributors, major infra
> > improvements
> >
> > And while I've noted some areas of contribution for each, most
> > importantly everyone on this list is a valued member of the Beam
> > community that the PMC trusts with the responsibilities of a Beam
> > committer [1].
> >
> > A big thanks to all for their contributions.
> >
> > Kenn
> >
> > [1]
> >
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> >
> >
> >
> > --
> >
> >
> >
> >
> > Got feedback? tinyurl.com/swegner-feedback
> > <https://tinyurl.com/swegner-feedback>
>


Re: [DISCUSS] Beam public roadmap

2018-10-12 Thread Tim Robertson
Thanks Kenn,

I think this is a very good idea.

My preference would be part of the website and not on a wiki. Those who
need to contribute can do so easily and I find wikis often get
messy/stale/overwhelming. The website will also mean that we can use dev@
and Jira to track, discuss and help agree upon the roadmap content in a
more controlled manner than a wiki which can change without notification.

I find it difficult to provide input on style / format without mentioning
what might be on it I'm afraid.

- I'd favour a short concise read (7 mins?) with links out to Jiras for
more detail and to help show transparent progress

- Potential users currently observing the project is a very important
audience IMO (en-premise Hadoop users, enterprise users seeking Kerberos
support, AWS cloud users etc). Might it help for us to identify the
audiences the roadmap is intended for to help steer the style?

Tim


On Fri, Oct 12, 2018 at 6:35 PM Kenneth Knowles  wrote:

> Personally, I think cwiki is best for dev community, while important stuff
> for users should go on the web site. But experimenting with the content on
> cwiki seems like a quick and easy thing to try out.
>
> On Fri, Oct 12, 2018 at 1:43 AM Maximilian Michels  wrote:
>
>> Great idea, Kenn!
>>
>> How about putting the roadmap in the Confluent wiki? We can link the
>> page from the web site.
>>
>> The timeline should not be too specific but should give users an idea of
>> what to expect.
>>
>> On 10.10.18 22:43, Romain Manni-Bucau wrote:
>> > What about a link in the menu. It should contain a list of features and
>> > estimate date with probable error (like "in 5 months +- 1 months)
>> > otherwise it does not bring much IMHO.
>> >
>> > Le mer. 10 oct. 2018 23:32, Kenneth Knowles > > <mailto:k...@apache.org>> a écrit :
>> >
>> > Hi all,
>> >
>> > We made an attempt at putting together a sort of roadmap [1] in the
>> > past and also some wide-ranging threads about what could be on it
>> > [2]. and I think we should pick it up again. The description I
>> > really liked was "strategic and user impacting initiatives (ongoing
>> > and future) in an easy to consume format" [3]. It seems that we had
>> > feedback asking for a Roadmap at the London summit [4].
>> >
>> > I would like to first focus on meta-questions rather than what would
>> > be on it:
>> >
>> >   - What style / format should it have to be most useful for users?
>> >   - Where should it be presented?
>> >
>> > I asked a couple people to try to find the roadmap on the web site,
>> > as a test, and they didn't really know which tab to click on first,
>> > so that's a starting problem. They didn't even find Works In
>> > Progress [5] after clicking Contribute. The level of detail of that
>> > list varies widely.
>> >
>> > I'd also love to see hypothetical formats for it, to see how to
>> > balance pithiness with crucial details.
>> >
>> > Kenn
>> >
>> > [1]
>> >
>> https://lists.apache.org/thread.html/4e1fffa2fde8e750c6d769bf4335853ad05b360b8bd248ad119cc185@%3Cdev.beam.apache.org%3E
>> > [2]
>> >
>> https://lists.apache.org/thread.html/f750f288af8dab3f468b869bf5a3f473094f4764db419567f33805d0@%3Cdev.beam.apache.org%3E
>> > [3]
>> >
>> https://lists.apache.org/thread.html/60d0333fd9e2c7be2f55e33b0d145f2908e3fe645c008636c86e1133@%3Cdev.beam.apache.org%3E
>> > [4]
>> >
>> https://lists.apache.org/thread.html/aa1306da25029dff12a49ba3ce63f2caf6a5f8ba73eda879c8403f3f@%3Cdev.beam.apache.org%3E
>> >
>> > [5] https://beam.apache.org/contribute/#works-in-progress
>> >
>>
>


Re: [DISCUSS] Gradle for the build ?

2018-10-10 Thread Tim Robertson
Thank you JB for starting this discussion.

Others comment on many of these points far better than I can, but my
experience is similar to JB.

1. IDEA integration (and laptop slowing like crazy) being the biggest
contributor to my feeling of being unproductive
2. Not knowing the correct way to modify the build scripts which I put down
to my own limitations

It seems we also need to help build Gradle expertise in our community, so
> that those that are motivated are empowered to contribute.


Nicely phrased. +1



On Wed, Oct 10, 2018 at 7:15 PM Scott Wegner  wrote:

> > Perhaps we should go through and prioritize (and add missing items to)
> BEAM-4045
>
> +1. It's hard to know where to start when there's such a laundry list of
> tasks. If you're having build issues, will you make sure it is represented
> in BEAM-4045, and "Vote" for the issues that you believe are the highest
> priority?
>
> I agree that the Gradle build is far from perfect (my top gripes are IDE
> integration and parallel/incremental build support). I believe that we're
> capable of making our build great, and continuing our investment in Gradle
> would be a shorter path than changing course again. Remember that our Maven
> build also had it's share of issues, which is why we as a community voted
> to replace it [1][2].
>
> It seems we also need to help build Gradle expertise in our community, so
> that those that are motivated are empowered to contribute. Does anybody
> have a good "Getting Started with Gradle" guide they recommend? Perhaps we
> could also link to it from the website/wiki.
>
> [1]
> https://lists.apache.org/thread.html/225dddcfc78f39bbb296a0d2bbef1caf37e17677c7e5573f0b6fe253@%3Cdev.beam.apache.org%3E
> [2]
> https://lists.apache.org/thread.html/bd399ecb17cd211be7c6089b562c09ba9116649c9eabe3b609606a3b@%3Cdev.beam.apache.org%3E
>
> On Wed, Oct 10, 2018 at 2:40 AM Robert Bradshaw 
> wrote:
>
>> Some rough stats (because I was curious): The gradle files have been
>> edited by ~79 unique contributors over 696 distinct commits, whereas the
>> maven ones were edited (over a longer time period) by ~130 unique
>> contributors over 1389 commits [1]. This doesn't capture how much effort
>> was put into these edits, but neither is restricted to a small set of
>> experts.
>>
>> Regarding "friendly for other languages" I don't think either is
>> necessarily easy to learn, but my impression is that the maven learning
>> curve shallower for those already firmly embedded in the Java ecosystem
>> (perhaps due to leveraging existing familiarity, and perhaps some due to
>> the implicit java-centric conventions that maven assumed about your
>> project), whereas with gradle at least I could keep pulling on the string
>> to unwind things to the bottom. The "I just want to build/test X without
>> editing/viewing the build files" seemed more natural with Gradle (e.g. I
>> can easily list all tasks).
>>
>> That being said, I don't think everyone needs to understand the full
>> build system. It's important that there be a critical mass that do (we have
>> that for both, and if we can simplify to improve this that'd be great),
>> it's easy enough to do basic changes (e.g. add a dependency, again I don't
>> think the barrier is sufficiently different for either), and works well out
>> of the box for someone who just wants to look up a command on the website
>> and edit code (the CLI is an improvement with Gradle, but it's clear that
>> (java) IDE support is a significant regression).
>>
>> Personally, I don't know much about IDE configuration (admittedly the
>> larger issue), but one action item I can take on is trying to eliminate the
>> need to do a "git clean" after building certain targets (assuming I can
>> reproduce this).
>>
>> Perhaps we should go through and prioritize (and add missing items to)
>> BEAM-4045
>> https://issues.apache.org/jira/issues/?jql=parent%20%3D%20BEAM-4045%20ORDER%20BY%20priority%20DESC
>> ? There's always a long tail with this kind of thing, and looking at the
>> whole list can be daunting, but putting it in the correct order and
>> knocking off the top N items could possibly go a long way.
>>
>> - Robert
>>
>> [1] The commands I ran were (with and without the uniq)
>>
>> $ find . -name 'build.gradle' | xargs git log | grep Author: | grep -o
>> '[^< ]*@' | sort | uniq | wc
>> $ find . -name 'pom.xml' | xargs git log | grep Author: | grep -o '[^<
>> ]*@' | sort | uniq | wc
>>
>> On Wed, Oct 10, 2018 at 10:31 AM Etienne Chauchot 
>> wrote:
>>
>>> Hi all,
>>> I must admit that I agree on the status especially regarding 2 points:
>>> 1. new contributors obstacles: gradle learning curve might be too long
>>> for spare-time contributors, also complex scripted build takes time to
>>> understand comparing to self-descriptive one.
>>> 2. IDE integration kind of slows down development.
>>>
>>> Now, regarding how we improve the situation, I think we need to discuss
>>> and identify tasks and tackle them all together even if 

Re: 2.7.0 release notes inconsistent with released code

2018-10-08 Thread Tim Robertson
Thanks Andrew - and sorry folks, that was simply me doing a bad email
search as I looked into the issue. I don't think there is much more to do
communicate.

I just corrected the label in Jira and I see the release notes are updated
automatically - I'm not sure how we can include a note to say that adjusted
though as they seem fully dynamic. I presume it is enough as it is?






On Mon, Oct 8, 2018 at 6:39 PM Thomas Weise  wrote:

> As I understand, Tim's concern is the accuracy of the release notes,
> and +1 for correcting them. In the end it does not matter that much to the
> users when a release was proposed to be cut vs. when it actually happened,
> but what they get with the release.
>
> Perhaps between the contributors we could going forward just mention as of
> what commit the actual release occurred? Ultimately it is
> reviewer/contributor responsibility that issues they work on are marked
> with the correct fix version. There was some related discussion on when to
> set the fix version on issues. I think that it is best to defer that until
> the PR is merged and the issue resolved, to avoid confusion.
>
>
> On Mon, Oct 8, 2018 at 9:11 AM Andrew Pilloud  wrote:
>
>> Keep reading the proposal thread and you'll find: "We should follow the
>> calendar and aim to cut on 8/29, not 9/7 as I incorrectly wrote earlier."
>> There were several folowup emails in the thread with reminders of the
>> release cut date. Is there something we should do to better communicate
>> release cuts in the future?
>>
>>
>> https://lists.apache.org/thread.html/c4da2a5594d22121b5864662e64a027148993b5e0187ce5beda2714e@%3Cdev.beam.apache.org%3E
>>
>> Andrew
>>
>> On Mon, Oct 8, 2018 at 3:17 AM Tim Robertson 
>> wrote:
>>
>>> Hi folks
>>>
>>> Our release notes [1] for 2.7.0 say that Beam supports Elasticsearch 6 (
>>> BEAM-5107 <https://issues.apache.org/jira/browse/BEAM-5107>). The 2.7.0
>>> code [2] however does not seem to, while master does [3]. The PR [4] was
>>> merged on the 6th September and in the 2.7.0 chat I see that Charles
>>> announced:
>>>
>>> I will cut the initial 2.7.0 release branch on September 7.
>>>
>>> Is this a case of unfortunate timing (it was cut early?) and we just
>>> overlooked cherry picking that commit do you think?
>>>
>>> Do we correct release notes when mistakes are spotted?
>>>
>>> Thanks,
>>> Tim
>>>
>>> [1]
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
>>> [2]
>>> https://github.com/apache/beam/blob/release-2.7.0/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L1243
>>> [3]
>>> https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L1268
>>> [4] https://github.com/apache/beam/pull/6211
>>>
>>


2.7.0 release notes inconsistent with released code

2018-10-08 Thread Tim Robertson
Hi folks

Our release notes [1] for 2.7.0 say that Beam supports Elasticsearch 6 (
BEAM-5107 <https://issues.apache.org/jira/browse/BEAM-5107>). The 2.7.0
code [2] however does not seem to, while master does [3]. The PR [4] was
merged on the 6th September and in the 2.7.0 chat I see that Charles
announced:

I will cut the initial 2.7.0 release branch on September 7.

Is this a case of unfortunate timing (it was cut early?) and we just
overlooked cherry picking that commit do you think?

Do we correct release notes when mistakes are spotted?

Thanks,
Tim

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343654
[2]
https://github.com/apache/beam/blob/release-2.7.0/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L1243
[3]
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L1268
[4] https://github.com/apache/beam/pull/6211


Re: [PROPOSAL] Prepare Beam 2.8.0 release

2018-10-04 Thread Tim Robertson
I was in the middle of writing something similar when Ismaël posted.

Please do bear in mind that this is an international project and 7hrs is
not long enough to decide upon something that affects us all.

+1 on cutting 2.8.0 on 10/10 and thank you for pushing it forward

-1 on designating it as LTS:
While LTS is a statement of expectation in maintenance it also carries an
element of trust. I propose we should have a separate discussion about what
we might like to collectively achieve before announcing our first LTS
edition.
My concern stems from usability and first impressions - for example:
- Beam has real issues with HDFS today (BEAM-5036) which I propose as
blocker for announcing LTS
- DirectRunner and the inability to run basic pipelines on a few GB of data
is *really* putting people off our project - we might consider exploring
that as it affects our "brand"




On Thu, Oct 4, 2018 at 11:18 AM Ismaël Mejía  wrote:

> Hello,
>
> Thanks Ahmet for volunteering to do the release, and proposing this as an
> LTS.
>
> I have still some questions on our LTS policies (which may have
> consequences on the discussed release):
>
> What are the expected implications of upgrades in the LTS, e.g. If a
> connector let’s say Kafka is released using the 1.0 dependency, can it
> be moved upwards in a LTS to version 2.0 or this will be considered a
> breaking change and we should only move in minor versions. Will this
> rule be more relaxed for example for all cloud based dependencies
> (GCP, AWS) for example if a security issue or correctness/performance
> improvement?
>
> Given that this will last for a year maybe we should raise some of the
> dependencies to the latest versions. Following the recent discussion
> on dependencies that cannot be ‘automatically’ updated because of end
> user consequences, I still think about what we should do with
> (probably related to the previous paragraph):
>
> - Should we move Flink then to 1.6.x considering that 1.5.x won’t be
> maintained in less than 6 months.
> - Should we wait and upgrade Spark into version 2.4.0 (which is being
> voted at this moment but not released but could make sense for a LTS)
> or just stay in 2.3.x. Spark is less of an issue because it is a
> provided dep but still worth.
> - Should we update the IO connectors dependencies to the latest stable
> versions who aren’t, e.g. Elasticsearch, HBase,
>
> Of course the goal is not a last minute rush to do this so it fits in
> the LTS release, but to see that for LTS we may consider the ‘lasting
> consequences'.
>
> One last comment, next time we discuss a proposal please ensure that
> we wait at least 24h to reach conclusions or proceed, otherwise this
> will exclude opinions from people who are not in the right time zone
> (this is the reason why votes last 72h to ensure that everyone may be
> aware of what is been voted). This is not a mandatory requirement, but
> agreeing on a LTS in 7h seems a bit short.
> On Thu, Oct 4, 2018 at 1:36 AM Ahmet Altay  wrote:
> >
> > Great. I will do the cut on 10/10.
> >
> > Let's start by triaging the open issues targeted for 2.8.0 [1]. If you
> have any issues in this list please resolve them or move to the next
> release. If you are aware of any critical issues please add to this list.
> >
> > Ahmet
> >
> > [1]
> https://issues.apache.org/jira/browse/BEAM-5456?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20fixVersion%20%3D%202.8.0%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
> >
> > > +1 for the 2.7.0 release schedule. Thanks for volunteering. Do we want
> a standing owner for the LTS branch (like the Linux kernel has) or will we
> just take volunteers for each LTS release as they arise?
> >
> > We have not thought about this before. IMO, it is better to keep things
> simple and use the same process (i.e. "we just take volunteers for each LTS
> release as they arise") for patch releases in the future if/when we happen
> to need those.
> >
> >
> > On Wed, Oct 3, 2018 at 1:21 PM, Thomas Weise  wrote:
> >>
> >> +1
> >>
> >> On Wed, Oct 3, 2018 at 12:33 PM Ted Yu  wrote:
> >>>
> >>> +1
> >>>
> >>> On Wed, Oct 3, 2018 at 9:52 AM Jean-Baptiste Onofré 
> wrote:
> 
>  +1
> 
>  but we have to be fast in release process. 2.7.0 took more than 1
> month
>  to be cut !
> 
>  If no blocker, we have to just move forward.
> >
> >
> > +1
> >
> 
> 
>  Regards
>  JB
> 
>  On 03/10/2018 18:25, Ahmet Altay wrote:
>  > Hi all,
>  >
>  > Release cut date for the next release is 10/10 according to Beam
> release
>  > calendar [1]. Since the previous release is already mostly wrapped
> up
>  > (modulo blog post), I would like to propose starting the next
> release on
>  > time (10/10).
>  >
>  > Additionally I propose designating this release as the first
>  > long-term-support (LTS) release [2]. This should have no impact on
> the
>  > release process, however it would mean 

Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-28 Thread Tim Robertson
Thanks for sharing those results.

The second set (executors at 20-30) look similar to what I would have
expected.
BEAM-5036 definitely plays a part here as the data is not moved on HDFS
efficiently (fix in PR awaiting review now [1]).

To give an idea of the impact, here are some numbers from my own tests.
Without knowing your code, I presume mine is similar to your filter (take
data, modify it, write data with no shuffle/group/join)

My environment: 10 node YARN CDH 5.12.2 cluster, rewriting a 1.5TB AvroIO
file (code here [2]) I observed:

  - Using Spark API: 35 minutes
  - Beam AvroIO (2.6.0): 1.7hrs
  - Beam AvroIO with the 5036 fix: 42 minutes

Related: I also anticipate that varying the spark.default.parallelism will
affect Beam runtime.

Thanks,
Tim


[1] https://github.com/apache/beam/pull/6289
[2] https://github.com/gbif/beam-perf/tree/master/avro-to-avro


On Fri, Sep 28, 2018 at 9:27 AM Robert Bradshaw  wrote:

> Something here on the Beam side is clearly linear in the input size, as if
> there's a bottleneck where were' not able to get any parallelization. Is
> the spark variant running in parallel?
>
> On Fri, Sep 28, 2018 at 4:57 AM devinduan(段丁瑞) 
> wrote:
>
>> Hi
>> I have completed my test.
>> 1. Spark parameter :
>> deploy-mode client
>> executor-memory 1g
>> num-executors 1
>> driver-memory 1g
>>
>> WordCount:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1min8s
>>
>> 1min11s
>>
>> 1min18s
>>
>> Beam
>>
>> 6.4min
>>
>> 11min
>>
>> 22min
>>
>>
>>
>> Filter:
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.2min
>>
>> 1.7min
>>
>> 2.8min
>>
>> Beam
>>
>> 2.7min
>>
>> 4.1min
>>
>> 5.7min
>>
>>
>>
>> GroupbyKey + sum
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 3.6min
>>
>>
>>
>>
>>
>> Beam
>>
>> Failed, executor oom
>>
>>
>>
>>
>>
>>
>>
>> Union
>>
>>
>>
>> 300MB
>>
>> 600MB
>>
>> 1.2G
>>
>> Spark
>>
>> 1.7min
>>
>> 2.6min
>>
>> 5.1min
>>
>> Beam
>>
>> 3.6min
>>
>> 6.2min
>>
>> 11min
>>
>>
>>
>> 2. Spark parameter :
>>
>> deploy-mode client
>>
>> executor-memory 1g
>>
>> driver-memory 1g
>>
>> spark.dynamicAllocation.enabledtrue
>>
>


Re: Compatibility Matrix vs Runners in the code base

2018-09-21 Thread Tim Robertson
"what do you think about limiting the matrix to Runners in the Beam code
base"

+1 but perhaps we should having a table listing Runners under development
like we do for IOs.

As a concrete example we have MapReduce listed in the matrix [1], a page
documenting it [2] stating it is in Beam 2.6.0 but unless I'm mistaken the
code exists only on a branch [3] and hasn't been touched for a while.

Thanks,
Tim

[1] https://beam.apache.org/documentation/runners/capability-matrix/
[2] https://beam.apache.org/documentation/runners/mapreduce/
[3] https://github.com/apache/beam/tree/mr-runner

On Fri, Sep 21, 2018 at 1:37 PM Jean-Baptiste Onofré 
wrote:

> Hi Max,
>
> not sure I fully follow you there. You mean that we would have kind of
> compability matrix on dedicated page of each runner ?
>
> Regards
> JB
>
> On 21/09/2018 10:57, Maximilian Michels wrote:
> > Hi Beamers,
> >
> > There have been occasions where people asked me about Runner XY and I
> > had to find out that it only exists in the compatibility matrix, but not
> > as part of our code base. More interestingly, I couldn't even find its
> > code or documentation via my favorite search engine.
> >
> > This seems to be the case for multiple Runners in the matrix.
> >
> > The compatibility matrix will need an overhaul anyways with the
> > portability changes, but what do you think about limiting the matrix to
> > Runners in the Beam code base?
> >
> > Thanks,
> > Max
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [ANNOUNCEMENT] New Beam chair: Kenneth Knowles

2018-09-20 Thread Tim Robertson
Thank you to Davor all the PMC - I can only imagine how much work it has
been to get Beam to where it is today.

Congratulations Kenn!

On Thu, Sep 20, 2018 at 1:05 AM Tyler Akidau  wrote:

> Thanks Davor, and congrats Kenn!
>
> -Tyler
>
> On Wed, Sep 19, 2018 at 2:43 PM Yifan Zou  wrote:
>
>> Congratulations Kenn!
>>
>> On Wed, Sep 19, 2018 at 2:36 PM Robert Burke  wrote:
>>
>>> Congrats Kenn! :D
>>>
>>> On Wed, Sep 19, 2018, 2:21 PM Ismaël Mejía  wrote:
>>>
 Congratulations and welcome Kenn as new chair!
 Thanks Davor for your hard work too.

 On Wed, Sep 19, 2018 at 11:14 PM Rui Wang  wrote:

> Congrats!
>
> -Rui
>
> On Wed, Sep 19, 2018 at 2:12 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> Congrats!
>>
>> On Wed, Sep 19, 2018 at 2:05 PM Ahmet Altay  wrote:
>>
>>> Congratulations, Kenn! And thank you Davor.
>>>
>>> On Wed, Sep 19, 2018 at 1:44 PM, Anton Kedin 
>>> wrote:
>>>
 Congrats!

 On Wed, Sep 19, 2018 at 1:36 PM Ankur Goenka 
 wrote:

> Congrats Kenn!
>
> On Wed, Sep 19, 2018 at 1:35 PM Amit Sela 
> wrote:
>
>> Well deserved! Congrats Kenn.
>>
>> On Wed, Sep 19, 2018 at 4:25 PM Kai Jiang 
>> wrote:
>>
>>> Congrats, Kenn!
>>> ᐧ
>>>
>>> On Wed, Sep 19, 2018 at 1:23 PM Alan Myrvold <
>>> amyrv...@google.com> wrote:
>>>
 Congrats, Kenn.

 On Wed, Sep 19, 2018 at 1:08 PM Maximilian Michels <
 m...@apache.org> wrote:

> Congrats!
>
> On 19.09.18 22:07, Robin Qiu wrote:
> > Congratulations, Kenn!
> >
> > On Wed, Sep 19, 2018 at 1:05 PM Lukasz Cwik <
> lc...@google.com
> > > wrote:
> >
> > Congrats Kenn.
> >
> > On Wed, Sep 19, 2018 at 12:54 PM Davor Bonaci <
> da...@apache.org
> > > wrote:
> >
> > Hi everyone --
> > It is with great pleasure that I announce that at
> today's
> > meeting of the Foundation's Board of Directors, the
> Board has
> > appointed Kenneth Knowles as the second chair of the
> Apache Beam
> > project.
> >
> > Kenn has served on the PMC since its inception, and
> is very
> > active and effective in growing the community. His
> exemplary
> > posts have been cited in other projects. I'm super
> happy to have
> > Kenn accepted the nomination, and I'm confident that
> he'll serve
> > with distinction.
> >
> > As for myself, I'm not going anywhere. I'm still
> around and will
> > be as active as I have recently been. Thrilled to be
> able to
> > pass the baton to such a key member of this
> community and to
> > have less administrative work to do ;-).
> >
> > Please join me in welcoming Kenn to his new role,
> and I ask that
> > you support him as much as possible. As always,
> please let me
> > know if you have any questions.
> >
> > Davor
> >
>

>>>


Re: Re: How to optimize the performance of Beam on Spark(Internet mail)

2018-09-19 Thread Tim Robertson
Thank you Devin

Can you also please try Beam with more spark executors if you are able?

On Wed, Sep 19, 2018 at 10:47 AM devinduan(段丁瑞) 
wrote:

> Thanks for your help!
> I will test other examples of Beam On Spark in the future and then feed
> back the results.
> Regards
> devin
>
>
> *From:* Jean-Baptiste Onofré 
> *Date:* 2018-09-19 16:32
> *To:* devinduan(段丁瑞) ; dev 
> *Subject:* Re: How to optimize the performance of Beam on Spark(Internet
> mail)
>
> Thanks for the details.
>
> I will take a look later tomorrow (I have another issue to investigate
> on the Spark runner today for Beam 2.7.0 release).
>
> Regards
> JB
>
> On 19/09/2018 08:31, devinduan(段丁瑞) wrote:
> > Hi,
> > I test 300MB data file.
> > Use command like:
> > ./spark-submit --master yarn --deploy-mode client  --class
> > com.test.BeamTest --executor-memory 1g --num-executors 1 --driver-memory
> 1g
> >
> >  I set only one exeuctor. so task run in sequence . One task cost 10s.
> > However, Spark task cost only 0.4s
> >
> >
> >
> > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net
> >
> > *Date:* 2018-09-19 12:22
> > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org
> >
> > *Subject:* Re: How to optimize the performance of Beam on
> > Spark(Internet mail)
> >
> > Hi,
> >
> > did you compare the stages in the Spark UI in order to identify which
> > stage is taking time ?
> >
> > You use spark-submit in both cases for the bootstrapping ?
> >
> > I will do a test here as well.
> >
> > Regards
> > JB
> >
> > On 19/09/2018 05:34, devinduan(段丁瑞) wrote:
> > > Hi,
> > > Thanks for you reply.
> > > Our team plan to use Beam instead of Spark, So I'm testing the
> > > performance of Beam API.
> > > I'm coding some example through Spark API and Beam API , like
> > > "WordCount" , "Join",  "OrderBy",  "Union" ...
> > > I use the same Resources and configuration to run these Job.
> > >Tim said I should remove "withNumShards(1)" and
> > > set spark.default.parallelism=32. I did it and tried again, but
> > Beam job
> > > still running very slowly.
> > > Here is My Beam code and Spark code:
> > >Beam "WordCount":
> > >
> > >Spark "WordCount":
> > >
> > >I will try the other example later.
> > >
> > > Regards
> > > devin
> > >
> > >
> > > *From:* Jean-Baptiste Onofré <mailto:j...@nanthrax.net
> >
> > > *Date:* 2018-09-18 22:43
> > > *To:* dev@beam.apache.org <mailto:dev@beam.apache.org
> >
> > > *Subject:* Re: How to optimize the performance of Beam on
> > > Spark(Internet mail)
> > >
> > > Hi,
> > >
> > > The first huge difference is the fact that the spark runner
> > still uses
> > > RDD whereas directly using spark, you are using dataset. A
> > bunch of
> > > optimization in spark are related to dataset.
> > >
> > > I started a large refactoring of the spark runner to leverage
> > Spark 2.x
> > > (and dataset).
> > > It's not yet ready as it includes other improvements (the
> > portability
> > > layer with Job API, a first check of state API, ...).
> > >
> > > Anyway, by Spark wordcount, you mean the one included in the
> spark
> > > distribution ?
> > >
> > > Regards
> > > JB
> > >
> > > On 18/09/2018 08:39, devinduan(段丁瑞) wrote:
> > > > Hi,
> > > > I'm testing Beam on Spark.
> > > > I use spark example code WordCount processing 1G data
> > file, cost 1
> > > > minutes.
> > > > However, I use Beam example code WordCount processing
> > the same
> > > file,
> > > > cost 30minutes.
> > > > My Spark parameter is :  --deploy-mode client
> > >  --executor-memory 1g
> > > > --num-executors 1 --driver-memory 1g
> > > > My Spark version is 2.3.1,  Beam version is 2.5
> > > > Is there any optimization method?
> > > > Thank you.
> > > >
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>


Re: How to optimize the performance of Beam on Spark

2018-09-18 Thread Tim Robertson
Hi devinduan

The known issues Robert links there are actually HDFS related and not
specific to Spark.  The improvement we're seeking is that the final copy of
the output file can be optimised by using a "move" instead of "copy" andI
expect to have it fixed for Beam 2.8.0.  On a small dataset like this
though, I don't think it will impact performance too much.

Can you please elaborate on your deployment?  It looks like you are using a
cluster (i.e. deploy-mode client) but are you using HDFS?

I have access to a Cloudera CDH 5.12 Hadoop cluster and just ran an example
word count as follows - I'll explain the parameters to tune below:

1) I generate some random data (using common Hadoop tools)
hadoop jar
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar \
  teragen \
  -Dmapred.map.tasks=100 \
  -Dmapred.map.tasks.speculative.execution=false \
  1000  \
  /tmp/tera

This puts 100 files totalling just under 1GB on which I will run the word
count. They are stored in the HDFS filesystem.

2) Run the word count using Spark (2.3.x) and Beam 2.5.0

In my cluster I have YARN to allocate resources, and an HDFS filesystem.
This will be different if you run Spark as standalone, or on a cloud
environment.

spark2-submit \
  --conf spark.default.parallelism=45 \
  --class org.apache.beam.runners.spark.examples.WordCount \
  --master yarn \
  --executor-memory 2G \
  --executor-cores 5 \
  --num-executors 9 \
  --jars
beam-sdks-java-core-2.5.0.jar,beam-runners-core-construction-java-2.5.0.jar,beam-runners-core-java-2.5.0.jar,beam-sdks-java-io-hadoop-file-system-2.5.0.jar
\
  beam-runners-spark-2.5.0.jar \
  --runner=SparkRunner \
  --inputFile=hdfs:///tmp/tera/* \
  --output=hdfs:///tmp/wordcount

The jars I provide here are the minimum needed for running on HDFS with
Spark and normally you'd build those into your project as an über jar.

The important bits for tuning for performance are the following - these
will be applicable for any Spark deployment (unless embedded):

  spark.default.parallelism - controls the parallelism of the beam
pipeline. In this case, how many workers are tokenizing the input data.
  executor-memory, executor-cores, num-executors - controls the resources
spark will use

Note, that the parallelism of 45 means that the 5 cores in the 9 executors
can all run concurrently (i.e. 5x9 = 45). When you get to very large
datasets, you will likely have parallelism much higher.

In this test I see around 20 seconds initial startup of Spark (copying
jars, requesting resources from YARN, establishing the Spark context) but
once up the job completes in a few seconds writing the output into 45 files
(because of the parallelism). The files are named
/tmp/wordcount-000*-of-00045.

I hope this helps provide a few pointers, but if you elaborate on your
environment we might be able to assist more.

Best wishes,
Tim













On Tue, Sep 18, 2018 at 9:29 AM Robert Bradshaw  wrote:

> There are known performance issues with Beam on Spark that are being
> worked on, e.g. https://issues.apache.org/jira/browse/BEAM-5036 . It's
> possible you're hitting something different, but would be worth
> investigating. See also
> https://lists.apache.org/list.html?dev@beam.apache.org:lte=1M:Performance%20of%20write
>
> On Tue, Sep 18, 2018 at 8:39 AM devinduan(段丁瑞) 
> wrote:
>
>> Hi,
>> I'm testing Beam on Spark.
>> I use spark example code WordCount processing 1G data file, cost 1
>> minutes.
>> However, I use Beam example code WordCount processing the same file,
>> cost 30minutes.
>> My Spark parameter is :  --deploy-mode client  --executor-memory 1g
>> --num-executors 1 --driver-memory 1g
>> My Spark version is 2.3.1,  Beam version is 2.5
>> Is there any optimization method?
>> Thank you.
>>
>>
>>
>


Re: [VOTE] Donating the Dataflow Worker code to Apache Beam

2018-09-14 Thread Tim
+1

> On 15 Sep 2018, at 01:23, Yifan Zou  wrote:
> 
> +1
> 
>> On Fri, Sep 14, 2018 at 4:20 PM David Morávek  
>> wrote:
>> +1
>> 
>> 
>> 
>>> On 15 Sep 2018, at 00:59, Anton Kedin  wrote:
>>> 
>>> +1
>>> 
 On Fri, Sep 14, 2018 at 3:22 PM Alan Myrvold  wrote:
 +1
 
> On Fri, Sep 14, 2018 at 3:16 PM Boyuan Zhang  wrote:
> +1
> 
>> On Fri, Sep 14, 2018 at 3:15 PM Henning Rohde  wrote:
>> +1
>> 
>>> On Fri, Sep 14, 2018 at 2:40 PM Ahmet Altay  wrote:
>>> +1 (binding)
>>> 
 On Fri, Sep 14, 2018 at 2:35 PM, Lukasz Cwik  wrote:
 +1 (binding)
 
> On Fri, Sep 14, 2018 at 2:34 PM Pablo Estrada  
> wrote:
> +1
> 
>> On Fri, Sep 14, 2018 at 2:32 PM Andrew Pilloud  
>> wrote:
>> +1
>> 
>>> On Fri, Sep 14, 2018 at 2:31 PM Lukasz Cwik  
>>> wrote:
>>> There was generally positive support and good feedback[1] but it 
>>> was not unanimous. I wanted to bring the donation of the Dataflow 
>>> worker code base to Apache Beam master to a vote.
>>> 
>>> +1: Support having the Dataflow worker code as part of Apache Beam 
>>> master branch
>>> -1: Dataflow worker code should live elsewhere
>>> 
>>> 1: 
>>> https://lists.apache.org/thread.html/89efd3bc1d30f3d43d4b361a5ee05bd52778c9dc3f43ac72354c2bd9@%3Cdev.beam.apache.org%3E
>>> 


Re: Donating the Dataflow Worker code to Apache Beam

2018-09-13 Thread Tim Robertson
+1 (non googler)
It sounds pragmatic, helps with transparency should issues arise and
enables more people to fix.


On Thu, Sep 13, 2018 at 8:15 PM Dan Halperin  wrote:

> From my perspective as a (non-Google) community member, huge +1.
>
> I don't see anything bad for the community about open sourcing more of the
> probably-most-used runner. While the DirectRunner is probably still the
> most referential implementation of Beam, can't hurt to see more working
> code. Other runners or runner implementors can refer to this code if they
> want, and ignore it if they don't.
>
> In terms of having more code and tests to support, well, that's par for
> the course. Will this change make the things that need to be done to
> support them more obvious? (E.g., "this PR is blocked because someone at
> Google on Dataflow team has to fix something" vs "this PR is blocked
> because the Apache Beam code in foo/bar/baz is failing, and anyone who can
> see the code can fix it"). The latter seems like a clear win for the
> community.
>
> (As long as the code donation is handled properly, but that's completely
> orthogonal and I have no reason to think it wouldn't be.)
>
> Thanks,
> Dan
>
> On Thu, Sep 13, 2018 at 11:06 AM Lukasz Cwik  wrote:
>
>> Yes, I'm specifically asking the community for opinions as to whether it
>> should be accepted or not.
>>
>> On Thu, Sep 13, 2018 at 10:51 AM Raghu Angadi  wrote:
>>
>>> This is terrific!
>>>
>>> Is thread asking for opinions from the community about if it should be
>>> accepted? Assuming Google side decision is made to contribute, big +1 from
>>> me to include it next to other runners.
>>>
>>> On Thu, Sep 13, 2018 at 10:38 AM Lukasz Cwik  wrote:
>>>
 At Google we have been importing the Apache Beam code base and
 integrating it with the Google portion of the codebase that supports the
 Dataflow worker. This process is painful as we regularly are making
 breaking API changes to support libraries related to running portable
 pipelines (and sometimes in other places as well). This has made it
 sometimes difficult for PR changes to make changes without either breaking
 something for Google or waiting for a Googler to make the change internally
 (e.g. dependency updates).

 This code is very similar to the other integrations that exist for
 runners such as Flink/Spark/Apex/Samza. It is an adaption layer that sits
 on top of an execution engine. There is no super secret awesome stuff as
 this code was already publicly visible in the past when it was part of the
 Google Cloud Dataflow github repo[1].

 Process wise the code will need to get approval from Google to be
 donated and for it to go through the code donation process but before we
 attempt to do that, I was wondering whether the community would object to
 adding this code to the master branch?

 The up side is that people can make breaking changes and fix it for all
 runners. It will also help Googlers contribute more to the portability
 story as it will remove the burden of doing the code import (wasted time)
 and it will allow people to develop in master (can have the whole project
 loaded in a single IDE).

 The downsides are that this will represent more code and unit tests to
 support.

 1:
 https://github.com/GoogleCloudPlatform/DataflowJavaSDK/tree/hotfix_v1.2/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker

>>>


Re: [FYI] Paper of Building Beam Runner for IBM Streams

2018-09-10 Thread Tim
Thanks for sharing Manu - interesting paper indeed.

Tim

> On 10 Sep 2018, at 16:02, Maximilian Michels  wrote:
> 
> Excellent write-up. Thank you!
> 
>> On 09.09.18 20:43, Jean-Baptiste Onofré wrote:
>> Good idea. It could also help people who wants to create runners.
>> Regards
>> JB
>>> On 09/09/2018 13:00, Manu Zhang wrote:
>>> Hi all,
>>> 
>>> I've spent the weekend reading Challenges and Experiences in Building an
>>> Efficient Apache Beam Runner For IBM Streams
>>> <http://www.vldb.org/pvldb/vol11/p1742-li.pdf> from the August
>>> proceeding of PVLDB. It's quite enjoyable and urges me to reflect on how
>>> I (should've) implemented the Gearpump runner. I believe it will be
>>> beneficial to have more such papers and discussions as sharing design
>>> choices and lessons from various runners.
>>> 
>>> Enjoy it !
>>> Manu Zhang


Re: [DISCUSS] Unification of Hadoop related IO modules

2018-09-07 Thread Tim
Another +1 for option 3 (and preference of HadoopFormatIO naming).

Thanks Alexey,

Tim


> On 7 Sep 2018, at 19:13, Andrew Pilloud  wrote:
> 
> +1 for option 3. That approach will keep the mapping clean if SQL supports 
> this IO. It would be good to put the proxy in the old module and move the 
> implementation now. That way the old module can be easily deleted when the 
> time comes.
> 
> Andrew
> 
>> On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw  wrote:
>> OK, good, that's what I thought. So I stick by (3) which
>> 
>> 1) Cleans up the library for all future uses (hopefully the majority of all 
>> users :). 
>> 2) Is fully backwards compatible for existing users, minimizing disruption, 
>> and giving them time to migrate. 
>> 
>>> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko  
>>> wrote:
>>> In next release it will be still compatible because we keep module 
>>> “hadoop-input-format” but we make it deprecated and propose to use it 
>>> through module “hadoop-format” and proxy class HadoopFormatIO (or 
>>> HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read 
>>> functionality by using MapReduce InputFormat or OutputFormat classes. 
>>> Then, in future releases after next one, we can drop “hadoop-input-format”  
>>> since it was deprecated and we provided a time to move to new API. I think 
>>> this is less painful way for user but most complicated for us if the final 
>>> goal it to merge “hadoop-input-format” and “hadoop-output-format” together.
>>> 
>>>> On 7 Sep 2018, at 13:45, Robert Bradshaw  wrote:
>>>> 
>>>> Agree about not impacting users. Perhaps I misread (3), isn't it fully 
>>>> backwards compatible as well? 
>>>> 
>>>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré  
>>>> wrote:
>>>>> Hi,
>>>>> 
>>>>> in order to limit the impact for the existing users on Beam 2.x series,
>>>>> I would go for (1).
>>>>> 
>>>>> Regards
>>>>> JB
>>>>> 
>>>>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>>>>> > Hello everyone,
>>>>> > 
>>>>> > I’d like to discuss the following topic (see below) with community since
>>>>> > the optimal solution is not clear for me.
>>>>> > 
>>>>> > There is Java IO module, called “/hadoop-input-format/”, which allows to
>>>>> > use MapReduce InputFormat implementations to read data from different
>>>>> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>>>> > According to its name, it has only “Read" and it's missing “Write” part,
>>>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>>>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>>>> > this I created another module with this name. So, in the end, we will
>>>>> > have two different modules “/hadoop-input-format/” and
>>>>> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
>>>>> > every existed Java IO, that we have, incapsulates Read and Write parts
>>>>> > into one module. Additionally, we have “/hadoop-common/” and
>>>>> > /“hadoop-file-system/” as other hadoop-related modules. 
>>>>> > 
>>>>> > Now I’m thinking how it will be better to organise all these Hadoop
>>>>> > modules better. There are several options in my mind: 
>>>>> > 
>>>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>>>>> > “as it is”. 
>>>>> > Pros: no breaking changes, no additional work 
>>>>> > Cons: not logical for users to have the same IO in two different modules
>>>>> > and with different names.
>>>>> > 
>>>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>>>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>>>>> > keep the other Hadoop modules “as it is”.
>>>>> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
>>>>> > for users
>>>>> > Cons: breaking changes for user code because of module/IO renaming 
>>>>> > 
>>>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-09-05 Thread Tim Robertson
> across all the supported versions. Some of the features (e.g. server side
>>>>> timestamps) are disabled based on runtime Kafka version.  The unit tests
>>>>> currently run with single recent version. Integration tests could 
>>>>> certainly
>>>>> use multiple versions. With some more effort in writing tests, we could
>>>>> make multiple versions of the unit tests.
>>>>>
>>>>> Raghu.
>>>>>
>>>>> IO versioning
>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>> been EOL).
>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>> module uses a single version with the kafka client as a provided
>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>> tests).
>>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía 
>>>>> wrote:
>>>>>
>>>>>> I think we should refine the strategy on dependencies discussed
>>>>>> recently. Sorry to come late with this (I did not follow closely the
>>>>>> previous discussion), but the current approach is clearly not in line
>>>>>> with the industry reality (at least not for IO connectors + Hadoop +
>>>>>> Spark/Flink use).
>>>>>>
>>>>>> A really proactive approach to dependency updates is a good practice
>>>>>> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
>>>>>> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
>>>>>> Bigquery, AWS S3, etc. However when we talk about self hosted data
>>>>>> sources or processing systems this gets more complicated and I think
>>>>>> we should be more flexible and do this case by case (and remove these
>>>>>> from the auto update email reminder).
>>>>>>
>>>>>> Some open source projects have at least three maintained versions:
>>>>>> - LTS – maps to what most of the people have installed (or the big
>>>>>> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
>>>>>> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
>>>>>> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>>>>>>
>>>>>> Following the most recent versions can be good to be close to the
>>>>>> current development of other projects and some of the fixes, but these
>>>>>> versions are commonly not deployed for most users and adopting a LTS
>>>>>> or stable only approach won't satisfy all cases either. To understand
>>>>>> why this is complex let’s see some historical issues:
>>>>>>
>>>>>> IO versioning
>>>>>> * Elasticsearch. We delayed the move to version 6 until we heard of
>>>>>> more active users needing it (more deployments). We support 2.x and
>>>>>> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
>>>>>> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
>>>>>> because most big data distributions still use 5.x (however 5.x has
>>>>>> been EOL).
>>>>>> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
>>>>>> most of the deployments of Kafka use earlier versions than 1.x. This
>>>>>> module uses a single version with the kafka client as a provided
>>>>>> dependency and so far it works (but we don’t have multi version
>>>>>> tests).
>>>>>>
>>>>>> Runners versioning
>>>>>> * The move to Spark 1 to Spark 2 was decided after evaluating the
>>>>>> tradeoffs between maintaining multiple version support and to have
>>>>>> breaking changes with the issues of maintaining multiple versions.
>>>>>> This is a rare case but also with consequences. This dependency is
>>>>>> provided but we don't actively test issues on version migration.
>>>>>> * Flink moved to version 1.5, introducing incompatibility in
>>>>>> checkpointing (discussed recently and with not yet consensus on how to
>>>>>> handle).
>>>>>>
>>>>>> As you can see, it seems really hard to have a solution that fits all
>>>>>> cases. Probably the only rule that I see from this list is that we
>>>>>> should upgrade versions for connectors that have been deprecated or
>>>>>> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>>>>>>
>>>>>> For the case of the provided dependencies I wonder if as part of the
>>>>>> tests we should provide tests with multiple versions (note that this
>>>>>> is currently blocked by BEAM-4087).
>>>>>>
>>>>>> Any other ideas or opinions to see how we can handle this? What other
>>>>>> people in the community think ? (Notice that this can have relation
>>>>>> with the ongoing LTS discussion.
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
>>>>>>  wrote:
>>>>>> >
>>>>>> > Hi folks,
>>>>>> >
>>>>>> > I'd like to revisit the discussion around our versioning policy
>>>>>> specifically for the Hadoop ecosystem and make sure we are aware of the
>>>>>> implications.
>>>>>> >
>>>>>> > As an example our policy today would have us on HBase 2.1 and I
>>>>>> have reminders to address this.
>>>>>> >
>>>>>> > However, currently the versions of HBase in the major hadoop
>>>>>> distros are:
>>>>>> >
>>>>>> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>>>>>> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
>>>>>> assume is not widely adopted)
>>>>>> >  - AWS EMR HBase on 1.4
>>>>>> >
>>>>>> > On the versioning I think we might need a more nuanced approach to
>>>>>> ensure that we target real communities of existing and potential users.
>>>>>> Enterprise users need to stick to the supported versions in the
>>>>>> distributions to maintain support contracts from the vendors.
>>>>>> >
>>>>>> > Should our versioning policy have more room to consider on a case
>>>>>> by case basis?
>>>>>> >
>>>>>> > For Hadoop might we benefit from a strategy on which community of
>>>>>> users Beam is targeting?
>>>>>> >
>>>>>> > (OT: I'm collecting some thoughts on what we might consider to
>>>>>> target enterprise hadoop users - kerberos on all relevant IO, 
>>>>>> performance,
>>>>>> leaking beyond encryption zones with temporary files etc)
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Tim
>>>>>>
>>>>>


[DISCUSS] Versioning, Hadoop related dependencies and enterprise users

2018-08-28 Thread Tim Robertson
Hi folks,

I'd like to revisit the discussion around our versioning policy
specifically for the Hadoop ecosystem and make sure we are aware of the
implications.

As an example our policy today would have us on HBase 2.1 and I have
reminders to address this.

However, currently the versions of HBase in the major hadoop distros are:

 - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
 - Hortonworks HDP3 on HBase 2.0 (only recently released so we can assume
is not widely adopted)
 - AWS EMR HBase on 1.4

On the versioning I think we might need a more nuanced approach to ensure
that we target real communities of existing and potential users. Enterprise
users need to stick to the supported versions in the distributions to
maintain support contracts from the vendors.

Should our versioning policy have more room to consider on a case by case
basis?

For Hadoop might we benefit from a strategy on which community of users
Beam is targeting?

(OT: I'm collecting some thoughts on what we might consider to target
enterprise hadoop users - kerberos on all relevant IO, performance, leaking
beyond encryption zones with temporary files etc)

Thanks,
Tim


Re: [DISCUSS] Performance of write() in file based IO

2018-08-23 Thread Tim Robertson
Thanks for linking this discussion with BEAM-5036 (and transitively to
BEAM-4861 which also comes in to play) Jozek.

What Reuven speculated and Jozek had previously observed is indeed the
major cause. Today I've been testing the effect of a "move" using rename()
instead of a copy() and delete().

My test environment is different today but still using 1.5TB input data and
the code I linked earlier in GH [1]:

  - Spark API: 35 minutes
  - Beam AvroIO (2.6.0): 1.7hrs
  - Beam AvroIO with rename() patch: 42 minutes

On the DAG linked in the GH repo [1] stages 3&4 are reduced to seconds
saving 53 minutes from Beam 2.6.0 version which is the predominant gain
here.

Unless new comments come in I propose fixing BEAM-5036 and BEAM-4861 and
continuing discussion on those Jiras.
This requires a bit of exploration and decision around the expectations of
e.g. the target directory not existing and also correcting the incorrect
use of the HDFS API (it ignores the return value which can indicate error
on e.g. directory not existing today).

Thank you all for contributing to this discussion.

[1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro



On Thu, Aug 23, 2018 at 11:55 AM Jozef Vilcek  wrote:

> Just for reference, there is a JIRA open for
> FileBasedSink.moveToOutputFiles()  and filesystem move behavior
>
> https://issues.apache.org/jira/browse/BEAM-5036
>
>
> On Wed, Aug 22, 2018 at 9:15 PM Tim Robertson 
> wrote:
>
>> Reuven, I think you might be on to something
>>
>> The Beam HadoopFileSystem copy() does indeed stream through the driver
>> [1], and the FileBasedSink.moveToOutputFiles() seemingly uses that method
>> [2].
>> I'll cobble together a patched version to test using a rename() rather
>> than a copy() and report back findings before we consider the implications.
>>
>> Thanks
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L124
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L288
>>
>> On Wed, Aug 22, 2018 at 8:52 PM Tim Robertson 
>> wrote:
>>
>>> > Does HDFS support a fast rename operation?
>>>
>>> Yes. From the shell it is “mv” and in the Java API it is “rename(Path
>>> src, Path dst)”.
>>> I am not aware of a fast copy though. I think an HDFS copy streams the
>>> bytes through the driver (unless a distcp is issued which is a MR job).
>>>
>>> (Thanks for engaging in this discussion folks)
>>>
>>>
>>> On Wed, Aug 22, 2018 at 6:29 PM Reuven Lax  wrote:
>>>
>>>> I have another theory: in FileBasedSink.moveToOutputFiles we copy the
>>>> temporary files to the final destination and then delete the temp files.
>>>> Does HDFS support a fast rename operation? If so, I bet Spark is using that
>>>> instead of paying the cost of copying the files.
>>>>
>>>> On Wed, Aug 22, 2018 at 8:59 AM Reuven Lax  wrote:
>>>>
>>>>> Ismael, that should already be true. If not using dynamic destinations
>>>>> there might be some edges in the graph that are never used (i.e. no 
>>>>> records
>>>>> are ever published on them), but that should not affect performance. If
>>>>> this is not the case we should fix it.
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Wed, Aug 22, 2018 at 8:50 AM Ismaël Mejía 
>>>>> wrote:
>>>>>
>>>>>> Spark runner uses the Spark broadcast mechanism to materialize the
>>>>>> side input PCollections in the workers, not sure exactly if this is
>>>>>> efficient assigned in an optimal way but seems logical at least.
>>>>>>
>>>>>> Just wondering if we shouldn't better first tackle the fact that if
>>>>>> the pipeline does not have dynamic destinations (this case) WriteFiles
>>>>>> should not be doing so much extra magic?
>>>>>>
>>>>>> On Wed, Aug 22, 2018 at 5:26 PM Reuven Lax  wrote:
>>>>>> >
>>>>>> > Often only the metadata (i.e. temp file names) are shuffled, except
>>>>>> in the "spilling" case (which should only happen when using dynamic
>>>>>> destinations).
>>>>>> >
>>>>>> > WriteFiles depends heavily on side inputs. How are side inputs
>>>>>> implemented in the Spark runner?
>>>>>> >
>>>&g

Re: [DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Tim Robertson
Reuven, I think you might be on to something

The Beam HadoopFileSystem copy() does indeed stream through the driver [1],
and the FileBasedSink.moveToOutputFiles() seemingly uses that method [2].
I'll cobble together a patched version to test using a rename() rather than
a copy() and report back findings before we consider the implications.

Thanks

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java#L124
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java#L288

On Wed, Aug 22, 2018 at 8:52 PM Tim Robertson 
wrote:

> > Does HDFS support a fast rename operation?
>
> Yes. From the shell it is “mv” and in the Java API it is “rename(Path src,
> Path dst)”.
> I am not aware of a fast copy though. I think an HDFS copy streams the
> bytes through the driver (unless a distcp is issued which is a MR job).
>
> (Thanks for engaging in this discussion folks)
>
>
> On Wed, Aug 22, 2018 at 6:29 PM Reuven Lax  wrote:
>
>> I have another theory: in FileBasedSink.moveToOutputFiles we copy the
>> temporary files to the final destination and then delete the temp files.
>> Does HDFS support a fast rename operation? If so, I bet Spark is using that
>> instead of paying the cost of copying the files.
>>
>> On Wed, Aug 22, 2018 at 8:59 AM Reuven Lax  wrote:
>>
>>> Ismael, that should already be true. If not using dynamic destinations
>>> there might be some edges in the graph that are never used (i.e. no records
>>> are ever published on them), but that should not affect performance. If
>>> this is not the case we should fix it.
>>>
>>> Reuven
>>>
>>> On Wed, Aug 22, 2018 at 8:50 AM Ismaël Mejía  wrote:
>>>
>>>> Spark runner uses the Spark broadcast mechanism to materialize the
>>>> side input PCollections in the workers, not sure exactly if this is
>>>> efficient assigned in an optimal way but seems logical at least.
>>>>
>>>> Just wondering if we shouldn't better first tackle the fact that if
>>>> the pipeline does not have dynamic destinations (this case) WriteFiles
>>>> should not be doing so much extra magic?
>>>>
>>>> On Wed, Aug 22, 2018 at 5:26 PM Reuven Lax  wrote:
>>>> >
>>>> > Often only the metadata (i.e. temp file names) are shuffled, except
>>>> in the "spilling" case (which should only happen when using dynamic
>>>> destinations).
>>>> >
>>>> > WriteFiles depends heavily on side inputs. How are side inputs
>>>> implemented in the Spark runner?
>>>> >
>>>> > On Wed, Aug 22, 2018 at 8:21 AM Robert Bradshaw 
>>>> wrote:
>>>> >>
>>>> >> Yes, I stand corrected, dynamic writes is now much more than the
>>>> >> primitive window-based naming we used to have.
>>>> >>
>>>> >> It would be interesting to visualize how much of this codepath is
>>>> >> metatada vs. the actual data.
>>>> >>
>>>> >> In the case of file writing, it seems one could (maybe?) avoid
>>>> >> requiring a stable input, as shards are accepted as a whole (unlike,
>>>> >> say, sinks where a deterministic uid is needed for deduplication on
>>>> >> retry).
>>>> >>
>>>> >> On Wed, Aug 22, 2018 at 4:55 PM Reuven Lax  wrote:
>>>> >> >
>>>> >> > Robert - much of the complexity isn't due to streaming, but rather
>>>> because WriteFiles supports "dynamic" output (where the user can choose a
>>>> destination file based on the input record). In practice if a pipeline is
>>>> not using dynamic destinations the full graph is still generated, but much
>>>> of that graph is never used (empty PCollections).
>>>> >> >
>>>> >> > On Wed, Aug 22, 2018 at 3:12 AM Robert Bradshaw <
>>>> rober...@google.com> wrote:
>>>> >> >>
>>>> >> >> I agree that this is concerning. Some of the complexity may have
>>>> also
>>>> >> >> been introduced to accommodate writing files in Streaming mode,
>>>> but it
>>>> >> >> seems we should be able to execute this as a single Map operation.
>>>> >> >>
>>>> >> >> Have you profiled to see which stages and/or operations are
>>&g

Re: [DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Tim Robertson
> Does HDFS support a fast rename operation?

Yes. From the shell it is “mv” and in the Java API it is “rename(Path src,
Path dst)”.
I am not aware of a fast copy though. I think an HDFS copy streams the
bytes through the driver (unless a distcp is issued which is a MR job).

(Thanks for engaging in this discussion folks)


On Wed, Aug 22, 2018 at 6:29 PM Reuven Lax  wrote:

> I have another theory: in FileBasedSink.moveToOutputFiles we copy the
> temporary files to the final destination and then delete the temp files.
> Does HDFS support a fast rename operation? If so, I bet Spark is using that
> instead of paying the cost of copying the files.
>
> On Wed, Aug 22, 2018 at 8:59 AM Reuven Lax  wrote:
>
>> Ismael, that should already be true. If not using dynamic destinations
>> there might be some edges in the graph that are never used (i.e. no records
>> are ever published on them), but that should not affect performance. If
>> this is not the case we should fix it.
>>
>> Reuven
>>
>> On Wed, Aug 22, 2018 at 8:50 AM Ismaël Mejía  wrote:
>>
>>> Spark runner uses the Spark broadcast mechanism to materialize the
>>> side input PCollections in the workers, not sure exactly if this is
>>> efficient assigned in an optimal way but seems logical at least.
>>>
>>> Just wondering if we shouldn't better first tackle the fact that if
>>> the pipeline does not have dynamic destinations (this case) WriteFiles
>>> should not be doing so much extra magic?
>>>
>>> On Wed, Aug 22, 2018 at 5:26 PM Reuven Lax  wrote:
>>> >
>>> > Often only the metadata (i.e. temp file names) are shuffled, except in
>>> the "spilling" case (which should only happen when using dynamic
>>> destinations).
>>> >
>>> > WriteFiles depends heavily on side inputs. How are side inputs
>>> implemented in the Spark runner?
>>> >
>>> > On Wed, Aug 22, 2018 at 8:21 AM Robert Bradshaw 
>>> wrote:
>>> >>
>>> >> Yes, I stand corrected, dynamic writes is now much more than the
>>> >> primitive window-based naming we used to have.
>>> >>
>>> >> It would be interesting to visualize how much of this codepath is
>>> >> metatada vs. the actual data.
>>> >>
>>> >> In the case of file writing, it seems one could (maybe?) avoid
>>> >> requiring a stable input, as shards are accepted as a whole (unlike,
>>> >> say, sinks where a deterministic uid is needed for deduplication on
>>> >> retry).
>>> >>
>>> >> On Wed, Aug 22, 2018 at 4:55 PM Reuven Lax  wrote:
>>> >> >
>>> >> > Robert - much of the complexity isn't due to streaming, but rather
>>> because WriteFiles supports "dynamic" output (where the user can choose a
>>> destination file based on the input record). In practice if a pipeline is
>>> not using dynamic destinations the full graph is still generated, but much
>>> of that graph is never used (empty PCollections).
>>> >> >
>>> >> > On Wed, Aug 22, 2018 at 3:12 AM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>> >> >>
>>> >> >> I agree that this is concerning. Some of the complexity may have
>>> also
>>> >> >> been introduced to accommodate writing files in Streaming mode,
>>> but it
>>> >> >> seems we should be able to execute this as a single Map operation.
>>> >> >>
>>> >> >> Have you profiled to see which stages and/or operations are taking
>>> up the time?
>>> >> >> On Wed, Aug 22, 2018 at 11:29 AM Tim Robertson
>>> >> >>  wrote:
>>> >> >> >
>>> >> >> > Hi folks,
>>> >> >> >
>>> >> >> > I've recently been involved in projects rewriting Avro files and
>>> have discovered a concerning performance trait in Beam.
>>> >> >> >
>>> >> >> > I have observed Beam between 6-20x slower than native Spark or
>>> MapReduce code for a simple pipeline of read Avro, modify, write Avro.
>>> >> >> >
>>> >> >> >  - Rewriting 200TB of Avro files (big cluster): 14 hrs using
>>> Beam/Spark, 40 minutes with a map-only MR job
>>> >> >> >  - Rewriting 1.5TB Avro file (small cluster): 2 hrs using
>>> Beam/Spark, 18 minutes using vanilla Spark code. Test code available [1]
>>> >> >> >
>>> >> >> > These tests were running Beam 2.6.0 on Cloudera 5.12.x clusters
>>> (Spark / YARN) on reference Dell / Cloudera hardware.
>>> >> >> >
>>> >> >> > I have only just started exploring but I believe the cause is
>>> rooted in the WriteFiles which is used by all our file based IO. WriteFiles
>>> is reasonably complex with reshuffles, spilling to temporary files
>>> (presumably to accommodate varying bundle sizes/avoid small files), a
>>> union, a GBK etc.
>>> >> >> >
>>> >> >> > Before I go too far with exploration I'd appreciate thoughts on
>>> whether we believe this is a concern (I do), if we should explore
>>> optimisations or any insight from previous work in this area.
>>> >> >> >
>>> >> >> > Thanks,
>>> >> >> > Tim
>>> >> >> >
>>> >> >> > [1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro
>>>
>>


Re: [DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Tim Robertson
> Are we seeing similar discrepancies for Flink?

I am not sure I'm afraid (no easy access to flink right now). I tried
without success to get Apex runner going on Cloudera YARN for this today -
I'll keep trying when time allows.

I've updated the DAGs to show more detail:
https://github.com/gbif/beam-perf/tree/master/avro-to-avro

On Wed, Aug 22, 2018 at 1:41 PM Robert Bradshaw  wrote:

> That is quite the DAG... Are we seeing similar discrepancies for
> Flink? (Trying to understand if this is Beam->Spark translation bloat,
> or inherent to the WriteFiles transform itself.)
> On Wed, Aug 22, 2018 at 1:35 PM Tim Robertson 
> wrote:
> >
> > Thanks Robert
> >
> > > Have you profiled to see which stages and/or operations are taking up
> the time?
> >
> > Not yet. I'm browsing through the spark DAG produced which I've
> committed [1] and reading the code.
> >
> > [1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro
> >
> > On Wed, Aug 22, 2018 at 12:12 PM Robert Bradshaw 
> wrote:
> >>
> >> I agree that this is concerning. Some of the complexity may have also
> >> been introduced to accommodate writing files in Streaming mode, but it
> >> seems we should be able to execute this as a single Map operation.
> >>
> >> Have you profiled to see which stages and/or operations are taking up
> the time?
> >> On Wed, Aug 22, 2018 at 11:29 AM Tim Robertson
> >>  wrote:
> >> >
> >> > Hi folks,
> >> >
> >> > I've recently been involved in projects rewriting Avro files and have
> discovered a concerning performance trait in Beam.
> >> >
> >> > I have observed Beam between 6-20x slower than native Spark or
> MapReduce code for a simple pipeline of read Avro, modify, write Avro.
> >> >
> >> >  - Rewriting 200TB of Avro files (big cluster): 14 hrs using
> Beam/Spark, 40 minutes with a map-only MR job
> >> >  - Rewriting 1.5TB Avro file (small cluster): 2 hrs using Beam/Spark,
> 18 minutes using vanilla Spark code. Test code available [1]
> >> >
> >> > These tests were running Beam 2.6.0 on Cloudera 5.12.x clusters
> (Spark / YARN) on reference Dell / Cloudera hardware.
> >> >
> >> > I have only just started exploring but I believe the cause is rooted
> in the WriteFiles which is used by all our file based IO. WriteFiles is
> reasonably complex with reshuffles, spilling to temporary files (presumably
> to accommodate varying bundle sizes/avoid small files), a union, a GBK etc.
> >> >
> >> > Before I go too far with exploration I'd appreciate thoughts on
> whether we believe this is a concern (I do), if we should explore
> optimisations or any insight from previous work in this area.
> >> >
> >> > Thanks,
> >> > Tim
> >> >
> >> > [1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro
>


Re: [DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Tim Robertson
Thanks Robert

> Have you profiled to see which stages and/or operations are taking up the
time?

Not yet. I'm browsing through the spark DAG produced which I've committed
[1] and reading the code.

[1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro

On Wed, Aug 22, 2018 at 12:12 PM Robert Bradshaw 
wrote:

> I agree that this is concerning. Some of the complexity may have also
> been introduced to accommodate writing files in Streaming mode, but it
> seems we should be able to execute this as a single Map operation.
>
> Have you profiled to see which stages and/or operations are taking up the
> time?
> On Wed, Aug 22, 2018 at 11:29 AM Tim Robertson
>  wrote:
> >
> > Hi folks,
> >
> > I've recently been involved in projects rewriting Avro files and have
> discovered a concerning performance trait in Beam.
> >
> > I have observed Beam between 6-20x slower than native Spark or MapReduce
> code for a simple pipeline of read Avro, modify, write Avro.
> >
> >  - Rewriting 200TB of Avro files (big cluster): 14 hrs using Beam/Spark,
> 40 minutes with a map-only MR job
> >  - Rewriting 1.5TB Avro file (small cluster): 2 hrs using Beam/Spark, 18
> minutes using vanilla Spark code. Test code available [1]
> >
> > These tests were running Beam 2.6.0 on Cloudera 5.12.x clusters (Spark /
> YARN) on reference Dell / Cloudera hardware.
> >
> > I have only just started exploring but I believe the cause is rooted in
> the WriteFiles which is used by all our file based IO. WriteFiles is
> reasonably complex with reshuffles, spilling to temporary files (presumably
> to accommodate varying bundle sizes/avoid small files), a union, a GBK etc.
> >
> > Before I go too far with exploration I'd appreciate thoughts on whether
> we believe this is a concern (I do), if we should explore optimisations or
> any insight from previous work in this area.
> >
> > Thanks,
> > Tim
> >
> > [1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro
>


[DISCUSS] Performance of write() in file based IO

2018-08-22 Thread Tim Robertson
Hi folks,

I've recently been involved in projects rewriting Avro files and have
discovered a concerning performance trait in Beam.

I have observed Beam between 6-20x slower than native Spark or MapReduce
code for a simple pipeline of read Avro, modify, write Avro.

 - Rewriting 200TB of Avro files (big cluster): 14 hrs using Beam/Spark, 40
minutes with a map-only MR job
 - Rewriting 1.5TB Avro file (small cluster): 2 hrs using Beam/Spark, 18
minutes using vanilla Spark code. Test code available [1]

These tests were running Beam 2.6.0 on Cloudera 5.12.x clusters (Spark /
YARN) on reference Dell / Cloudera hardware.

I have only just started exploring but I believe the cause is rooted in the
WriteFiles which is used by all our file based IO. WriteFiles is reasonably
complex with reshuffles, spilling to temporary files (presumably to
accommodate varying bundle sizes/avoid small files), a union, a GBK etc.

Before I go too far with exploration I'd appreciate thoughts on whether we
believe this is a concern (I do), if we should explore optimisations or any
insight from previous work in this area.

Thanks,
Tim

[1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro


Re: dulicate key-value elements lost when transfering them as side-inputs

2018-08-21 Thread Tim Robertson
Thanks for this Vaclav

The failing test (1 minute timeout exception) is something we see sometimes
and indicates issues in the build environment or a flakey test. I triggered
another build by leaving a comment in the PR - just fyi, this is something
you can also do in the future.







On Tue, Aug 21, 2018 at 10:57 AM Plajt, Vaclav 
wrote:

> Hi,
>
> looking for reviewer https://github.com/apache/beam/pull/6257
>
>
> And maybe some help with failing test in mqtt IO (timeout).
>
>
> Vaclav
> --
> *From:* Lukasz Cwik 
> *Sent:* Monday, August 20, 2018 6:12:24 PM
> *To:* dev
> *Subject:* Re: dulicate key-value elements lost when transfering them as
> side-inputs
>
> Yes, that is a bug. I filed and assigned
> https://issues.apache.org/jira/browse/BEAM-5184 to you, feel free to
> unassign if your unable to make progress.
>
> On Mon, Aug 20, 2018 at 1:14 AM Plajt, Vaclav <
> vaclav.pl...@firma.seznam.cz> wrote:
>
>> Hi Beam devs,
>>
>> I'm working on Euphoria DSL, where we implemented `BroadcastHashJoin`
>> using side-inputs. But our test shows some missing data. We use `
>> View.asMultimap()` to get our join-small-side to view in form of 
>> `PCollectionView> Iterable>>`. Then some duplicated key-value (the same key and value
>> as some other element) gets lost. That is of course unfortunate behavior
>> when doing joins. I believe that it all nails down to:
>>
>>
>> https://github.com/apache/beam/blob/05fb694f265dda0254d7256e938e508fec9ba098/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionViews.java#L293
>>
>>
>> Where `HashMultimap` is used to gather all the elements to a `Multimap> V>`.  Which do not allow duplicate key-value pairs. Do you also feel
>> this is a bug? And if yes, then we would like to fix it by replacing `
>> HashMultimap` with `ArrayListMultimap` which allows allows duplicate
>> key-value pairs.
>>
>>
>> We can thing of some workarounds. But we prefer to do the fix, if
>> possible.
>>
>>
>> So what are your opinions? And how should we proceed?
>>
>>
>> Thank you.
>>
>> Vaclav Plajt
>>
>>
>> Je dobré vědět, že tento e-mail a přílohy jsou důvěrné. Pokud spolu
>> jednáme o uzavření obchodu, vyhrazujeme si právo naše jednání kdykoli
>> ukončit. Pro fanoušky právní mluvy - vylučujeme tím ustanovení občanského
>> zákoníku o předsmluvní odpovědnosti. Pravidla o tom, kdo u nás a jak
>> vystupuje za společnost a kdo může co a jak podepsat naleznete zde
>> 
>>
>> You should know that this e-mail and its attachments are confidential. If
>> we are negotiating on the conclusion of a transaction, we reserve the right
>> to terminate the negotiations at any time. For fans of legalese—we hereby
>> exclude the provisions of the Civil Code on pre-contractual liability. The
>> rules about who and how may act for the company and what are the signing
>> procedures can be found here
>> .
>>
>


Re: [Discuss] Add EXTERNAL keyword to CREATE TABLE statement

2018-08-15 Thread Tim
+1 for CREATE EXTERNAL TABLE with similar reasoning given by others on this 
thread.

Tim

> On 15 Aug 2018, at 23:01, Charles Chen  wrote:
> 
> +1 for CREATE EXTERNAL TABLE.  It is a good balance between the general SQL 
> expectation of having tables as an abstraction and reinforcing that Beam does 
> not store your data.
> 
>> On Wed, Aug 15, 2018 at 1:58 PM Rui Wang  wrote:
>> >  I think users will be more confused to find that 'CREATE TABLE' doesn't 
>> > exist then to learn that it might not always create a table. 
>> 
>> >> I think that having CREATE TABLE do something unexpected or not do 
>> >> something expected (or do the opposite things depending on the table type 
>> >> or some flag) is worse than having users look up the correct way of 
>> >> creating a data source in Beam SQL without expecting something we don't 
>> >> promise.
>> 
>> I agree on this. Enforcing users to look up documentation for the correct 
>> way is better than letting them use an ambiguous way that could fail their 
>> expectation.
>> 
>> 
>> -Rui
>> 
>>> On Wed, Aug 15, 2018 at 1:46 PM Anton Kedin  wrote:
>>> I think that something unique along the lines of `REGISTER EXTERNAL DATA 
>>> SOURCE` is probably fine, as it doesn't conflict with existing behaviors of 
>>> other dialects.
>>> 
>>> > There is a lot of value in making sure our common operations closely map 
>>> > to the equivalent common operations in other SQL dialects. 
>>> 
>>> We're trying to make opposite points using the same arguments :) A lot of 
>>> popular dialects make difference between CREATE TABLE and CREATE EXTERNAL 
>>> TABLE (or similar):
>>>  - T-SQL:
>>>   create: 
>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-transact-sql
>>>   create external: 
>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-2017
>>>   external datasource: 
>>> https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql?view=sql-server-2017
>>>  - PL/SQL:
>>>   create: 
>>> https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#i1106369
>>>   create external: 
>>> https://docs.oracle.com/cd/B19306_01/server.102/b14215/et_concepts.htm#i1009127
>>>  - postgres:
>>>   import foreign schema: 
>>> https://www.postgresql.org/docs/9.5/static/sql-importforeignschema.html
>>>   create table: 
>>> https://www.postgresql.org/docs/9.1/static/sql-createtable.html
>>>  - redshift:
>>>   create external schema: 
>>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html
>>>   create table: 
>>> https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html
>>>  - hive internal and external: 
>>> https://www.dezyre.com/hadoop-tutorial/apache-hive-tutorial-tables
>>> 
>>> My understanding is that the behavior of create table is somewhat similar 
>>> in all of the above dialects, from the high-level perspective it usually 
>>> creates a persistent table in the current storage context (database). 
>>> That's not what Beam SQL's create table does right now, and my opinion is 
>>> that it should not be called create table for this reason.
>>> 
>>> >  I think users will be more confused to find that 'CREATE TABLE' doesn't 
>>> > exist then to learn that it might not always create a table. 
>>> 
>>> I think that having CREATE TABLE do something unexpected or not do 
>>> something expected (or do the opposite things depending on the table type 
>>> or some flag) is worse than having users look up the correct way of 
>>> creating a data source in Beam SQL without expecting something we don't 
>>> promise.
>>> 
>>> >  (For example, a user guessing at the syntax of CREATE TABLE would have a 
>>> > better experience with the error being "field LOCATION not specified" 
>>> > rather than "operation CREATE TABLE not found".)
>>> 
>>> They have to look it up anyway (what format is location for a Pubsub topic? 
>>> or is it a subscription?), and when doing so I think it would be less 
>>> confusing to read that to get data from Pubsub/Kafka/... in Beam SQL you 
>>> have to do something like `REGISTER EXTERNAL DATA SOURCE` than `CREATE 
>>> TABLE`.
>>> 
>>> External 

Re: [VOTE] Apache Beam, version 2.6.0, release candidate #1

2018-08-03 Thread Tim Robertson
+1 (non binding)

With apologies to Valentyn and others, but only had time to test what was
feasible for us this week. Tested our existing pipelines using 2.6.0RC1
which source and sink using AvroIO / HDFS on Spark 2.3 (Cloudera) ran
without issue - our project tests all pass with 2.6.0RC1.

We'd like to raise that BEAM-4750 is noticeable and our project builds have
slowed significantly (tests). In general we encourage effort towards (and
thank those progressing) structured monitoring of performance across
releases.



On Fri, Aug 3, 2018 at 9:32 AM, Valentyn Tymofieiev 
wrote:

> Just wanted to highlight again to folks who are interested to help with
> qualifying the release: release validation checklist
> 
>   has
> 2.6.0 tab that shows what has been tested so far for this RC.
>
> Please sign up and add your results. It may be helpful to include in the
> spreadsheet which operation system was used to test the SDK.
>
> Some helpful links:
>
> https://beam.apache.org/contribute/release-guide/#run-validation-tests  -
> these are instructions how to perform release validation steps.
> https://beam.apache.org/get-started/ - these are instructions Beam users
> may actually be following when trying out Beam.
> https://s.apache.org/beam-release-validation - Validation
> checklist/acceptance criteria.
>
> Also it looks like these links may need to be updated to better reflect
> required action items to cover SQL and Go SDK.
>
> thanks,
> Valentyn
>
>
> On Thu, Aug 2, 2018 at 10:07 PM Suneel Marthi  wrote:
>
>> +1 non-binding
>>
>> 1. tested with beam samples
>> 2. verified sigs and hashes of artifacts
>>
>>
>> On Fri, Aug 3, 2018 at 12:43 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (binding)
>>>
>>> Tested with beam-samples.
>>>
>>> I didn't have time to include three Jira, but 2.7.0 should be in vote in
>>> soon ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 01/08/2018 01:50, Pablo Estrada wrote:
>>> > Hello everyone!
>>> >
>>> > I have been able to prepare a release candidate for Beam 2.6.0. : D
>>> >
>>> > Please review and vote on the release candidate #1 for the version
>>> > 2.6.0, as follows:
>>> >
>>> > [ ] +1, Approve the release
>>> > [ ] -1, Do not approve the release (please provide specific comments)
>>> >
>>> > The complete staged set of artifacts is available for your review,
>>> which
>>> > includes:
>>> > * JIRA release notes [1],
>>> > * the official Apache source release to be deployed to dist.apache.org
>>> >  [2], which is signed with the key with
>>> > fingerprint 2F1FEDCDF6DD7990422F482F65224E0292DD8A51 [3],
>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>> > * source code tag "v2.6.0-RC1" [5],
>>> > * website pull request listing the release and publishing the API
>>> > reference manual [6].
>>> > * Python artifacts are deployed along with the source release to the
>>> > dist.apache.org  [2].
>>> >
>>> > The vote will be open for at least 72 hours. It is adopted by majority
>>> > approval, with at least 3 PMC affirmative votes.
>>> >
>>> > Regards
>>> > -Pablo.
>>> >
>>> > [1]
>>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>>> projectId=12319527=12343392
>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.6.0/
>>> > [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>> > [4] https://repository.apache.org/content/repositories/
>>> orgapachebeam-1044/
>>> > [5] https://github.com/apache/beam/tree/v2.6.0-RC1
>>> > [6] https://github.com/apache/beam-site/pull/518
>>> >
>>> > --
>>> > Got feedback? go/pabloem-feedback
>>> 
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>


Re: SQS source

2018-07-31 Thread Tim Robertson
I took a pass at reviewing (non committer). I haven't worked on unbounded
IO so wasn't familiar enough with the timestamp and checkpointing but
otherwise it LGTM in general - thanks John and for applying the minor
suggestions.

OT: Reuven, if you have time on your hands there is also the KuduIO
awaiting review (https://github.com/apache/beam/pull/6021)





On Tue, Jul 31, 2018 at 5:07 PM, Reuven Lax  wrote:

> Ismael, do you have time for this review? If you're too busy, I can try to
> help review it.
>
> John, unfortunately, as Ismael said, even if we speed up the review the
> 2.6.0 branch has already been cut, and we try and only cherry pick
> important bugfixes. Hopefully the next release will be soon, and it's also
> possible to use the nightly Beam releases in the interim.
>
> Reuven
>
> On Tue, Jul 31, 2018 at 5:14 AM Ismaël Mejía  wrote:
>
>> Hi, we can try to speed up the review, but the 2.6.0 branch was
>> already cut and was stabilizing for the last two weeks, so I am not
>> sure it will make it. Next release should be cut shortly hopefully in
>> 3-4 weeks to follow the 6 week release plan. Hope this can work for
>> you.
>>
>> On Tue, Jul 31, 2018 at 2:13 AM John Rudolf Lewis 
>> wrote:
>> >
>> > I created a pr for my SqsIO contribution. I look forward to your
>> comments.
>> >
>> > https://github.com/apache/beam/pull/6101
>> >
>> > Any chance this could be a part of the 2.6.0 release?
>> >
>> > On Thu, Jul 19, 2018 at 7:39 AM, John Rudolf Lewis <
>> johnrle...@gmail.com> wrote:
>> >>
>> >> Thank you.
>> >>
>> >> I've created a jira ticket to add SQS and have assigned it to myself:
>> https://issues.apache.org/jira/browse/BEAM-4828
>> >>
>> >> Modified the documentation to show it as in-progress:
>> https://github.com/apache/beam/pull/5995
>> >>
>> >> And will be starting my work here: https://github.com/
>> JohnRudolfLewis/beam/tree/Add-SqsIO
>> >>
>> >>
>> >> On Thu, Jul 19, 2018 at 1:43 AM, Jean-Baptiste Onofré 
>> wrote:
>> >>>
>> >>> Agree with Ismaël.
>> >>>
>> >>> I would be more than happy to help on this one (as I contributed on
>> AMQP
>> >>> and JMS IOs ;)).
>> >>>
>> >>> Regards
>> >>> JB
>> >>>
>> >>> On 19/07/2018 10:39, Ismaël Mejía wrote:
>> >>> > Thanks for your interest John, it would be a really nice
>> contribution
>> >>> > to add SQS support.
>> >>> >
>> >>> > Some context on the kinesis stuff:
>> >>> >
>> >>> > The reason why kinesis is still in a separate module is more related
>> >>> > to a licensing problem. Kinesis uses some native libraries that are
>> >>> > published under a not 100% apache compatible license and we are not
>> >>> > allowed to shade and republish them but it seems there is a
>> workaround
>> >>> > now, for more details see
>> >>> > https://issues.apache.org/jira/browse/BEAM-3549
>> >>> > In any case if to use SQS you only need the Apache licensed aws-sdk
>> >>> > deps it is ok (and a good idea) if you put it in the
>> >>> > amazon-web-services module.
>> >>> >
>> >>> > The kinesis connector is way more complex for multiple reasons,
>> first,
>> >>> > the raw version of the amazon client libraries is not so ‘friendly’
>> >>> > and the guys who created KinesisIO had to do some workarounds to
>> >>> > provide accurate checkpointing/watermarks. So since SQS is a way
>> >>> > simpler system you should probably be ok basing it in simpler
>> sources
>> >>> > like AMQP or JMS.
>> >>> >
>> >>> > If you feel like to, please create the JIRA and don’t hesitate to
>> ask
>> >>> > questions if you find issues or if you need some review.
>> >>> >
>> >>> > On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik 
>> wrote:
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis <
>> johnrle...@gmail.com> wrote:
>> >>> >>>
>> >>> >>> I need an SQS source for my project that is using beam. A brief
>> search did not turn up any in-progress work in this area. Please point me
>> to the right repo if I missed it.
>> >>> >>
>> >>> >>
>> >>> >> To my knowledge there is none and nobody has marked it in progress
>> on https://beam.apache.org/documentation/io/built-in/. It would be good
>> to create a JIRA issue on https://issues.apache.org/ and send a PR to
>> add SQS to the inprogress list referencing your JIRA. I added you as a
>> contributor in JIRA so you should be able to assign yourself to any issues
>> that you create.
>> >>> >>
>> >>> >>>
>> >>> >>> Assuming there is no in-progress effort, I would like to
>> contribute an Amazon SQS source. I have a few questions before I begin.
>> >>> >>
>> >>> >>
>> >>> >> Great, note that this is a good starting point for authoring an IO
>> transform: https://beam.apache.org/documentation/io/authoring-overview/
>> >>> >>
>> >>> >>>
>> >>> >>>
>> >>> >>> It seems that the current AWS code is split into two different
>> modules: sdk/java/io/amazon-web-services which contains the
>> S3FileSystem, AwsOptions, etc, and sdk/java/io/kinesis which contains an
>> unbounded source based on a kinesis topic. I'd like to 

Re: ElasticsearchIO bulk delete

2018-07-30 Thread Tim Robertson
> we decided to postpone the feature

That makes sense.

I believe the ES6 branch is in-part working (I've looked at the code but
not used it) which you can see here [1] and the jira to watch or contribute
is [2]. It would be a useful addition to test independently and report any
observations or improvement requests on that jira.

The offer to assist in your first PR remains open for the future - please
don't hesitate to ask.

Thanks,
Tim

[1]
https://github.com/jsteggink/beam/tree/BEAM-3199/sdks/java/io/elasticsearch-6/src/main/java/org/apache/beam/sdk/io/elasticsearch
[2] https://issues.apache.org/jira/browse/BEAM-3199

On Mon, Jul 30, 2018 at 10:55 AM, Wout Scheepers <
wout.scheep...@vente-exclusive.com> wrote:

> Hey Tim,
>
>
>
> Thanks for your proposal to mentor me through my first PR.
>
> As we’re definitely planning to upgrade to ES6 when Beam supports it, we
> decided to postpone the feature (we have a fix that works for us, for now).
>
> When Beam supports ES6, I’ll be happy to make a contribution to get bulk
> deletes working.
>
>
>
> For reference, I opened a ticket (https://issues.apache.org/
> jira/browse/BEAM-5042).
>
>
>
> Cheers,
>
> Wout
>
>
>
>
>
> *From: *Tim Robertson 
> *Reply-To: *"u...@beam.apache.org" 
> *Date: *Friday, 27 July 2018 at 17:43
> *To: *"u...@beam.apache.org" 
> *Subject: *Re: ElasticsearchIO bulk delete
>
>
>
> Hi Wout,
>
>
>
> This is great, thank you. I wrote the partial update support you reference
> and I'll be happy to mentor you through your first PR - welcome aboard. Can
> you please open a Jira to reference this work and we'll assign it to you?
>
>
>
> We discussed having the "_xxx" fields in the document and triggering
> actions based on that in the partial update jira but opted to avoid
> it. Based on that discussion the ActionFn would likely be the preferred
> approach.  Would that be possible?
>
>
>
> It will be important to provide unit and integration tests as well.
>
>
>
> Please be aware that there is a branch and work underway for ES6 already
> which is rather different on the write() path so this may become redundant
> rather quickly.
>
>
>
> Thanks,
>
> Tim
>
>
>
> @timrobertson100 on the Beam slack channel
>
>
>
>
>
>
>
> On Fri, Jul 27, 2018 at 2:53 PM, Wout Scheepers  exclusive.com> wrote:
>
> Hey all,
>
>
>
> A while ago, I patched ElasticsearchIO to be able to do partial updates
> and deletes.
>
> However, I did not consider my patch pull-request-worthy as the json
> parsing was done inefficient (parsed it twice per document).
>
>
>
> Since Beam 2.5.0 partial updates are supported, so the only thing I’m
> missing is the ability to send bulk *delete* requests.
>
> We’re using entity updates for event sourcing in our data lake and need to
> persist deleted entities in elastic.
>
> We’ve been using my patch in production for the last year, but I would
> like to contribute to get the functionality we need into one of the next
> releases.
>
>
>
> I’ve created a gist that works for me, but is still inefficient (parsing
> twice: once to check the ‘_action` field, once to get the metadata).
>
> Each document I want to delete needs an additional ‘_action’ field with
> the value ‘delete’. It doesn’t matter the document still contains the
> redundant field, as the delete action only requires the metadata.
>
> I’ve added the method isDelete() and made some changes to the
> processElement() method.
>
> https://gist.github.com/wscheep/26cca4bda0145ffd38faf7efaf2c21b9
>
>
>
> I would like to make my solution more generic to fit into the current
> ElasticsearchIO and create a proper pull request.
>
> As this would be my first pull request for beam, can anyone point me in
> the right direction before I spent too much time creating something that
> will be rejected?
>
>
>
> Some questions on the top of my mind are:
>
>- Is it a good idea it to make the ‘action’ part for the bulk api
>generic?
>- Should it be even more generic? (e.g.: set an ‘ActionFn’ on the
>ElasticsearchIO)
>- If I want to avoid parsing twice, the parsing should be done outside
>of the getDocumentMetaData() method. Would this be acceptable?
>- Is it possible to avoid passing the action as a field in the
>document?
>- Is there another or better way to get the delete functionality in
>general?
>
>
>
> All feedback is more than welcome.
>
>
> Cheers,
> Wout
>
>
>
>
>
>
>


Re: CODEOWNERS for apache/beam repo

2018-07-12 Thread Tim Robertson
Hi Udi

I asked the GH helpdesk and they confirmed that only people with write
access will actually be automatically chosen.

It don't expect it should stop us using it, but we should be aware that
there are non-committers also willing to review.

Thanks,
Tim

On Thu, Jul 12, 2018 at 7:24 PM, Mikhail Gryzykhin 
wrote:

> Idea looks good in general.
>
> Did you look into ways to keep this file up-to-date? For example we can
> run monthly job to see if owner was active during this period.
>
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Thu, Jul 12, 2018 at 9:56 AM Udi Meiri  wrote:
>
>> Thanks all!
>> I'll try to get the file merged today and see how it works out.
>> Please surface any issues, such as with auto-assignment, here or in JIRA.
>>
>> On Thu, Jul 12, 2018 at 2:12 AM Etienne Chauchot 
>> wrote:
>>
>>> Hi,
>>>
>>> I added myself as a reviewer for some modules.
>>>
>>> Etienne
>>>
>>> Le lundi 09 juillet 2018 à 17:06 -0700, Udi Meiri a écrit :
>>>
>>> Hi everyone,
>>>
>>> I'm proposing to add auto-reviewer-assignment using Github's CODEOWNERS
>>> mechanism.
>>> Initial version is here: *https://github.com/apache/beam/pull/5909/files
>>> <https://github.com/apache/beam/pull/5909/files>*
>>>
>>> I need help from the community in determining owners for each component.
>>> Feel free to directly edit the PR (if you have permission) or add a
>>> comment.
>>>
>>>
>>> Background
>>> The idea is to:
>>> 1. Document good review candidates for each component.
>>> 2. Help choose reviewers using the auto-assignment mechanism. The
>>> suggestion is in no way binding.
>>>
>>>
>>>


Re: Beam Dependency Ownership

2018-06-28 Thread Tim Robertson
Thanks for this Yifan,
I've added my name to all Hadoop related dependencies, solr, along with es.



On Thu, Jun 28, 2018 at 3:28 PM, Etienne Chauchot 
wrote:

> I've added myself and @Tim Robertson on elasticsearchIO related deps.
>
> Etienne
>
> Le mercredi 27 juin 2018 à 14:05 -0700, Chamikara Jayalath a écrit :
>
> It's mentioned under "Dependency declarations may identify owners that are
> responsible for upgrading respective dependencies". Feel free to update if
> you think more details should be added to it. I think it'll be easier if we
> transfer data in spreadsheet to comments close to dependency declarations
> instead of maintaining the spreadsheet (after we collect the data).
> Otherwise we'll have to put an extra effort to make sure that the
> spreadsheet, BeamModulePlugin, and Python setup.py are in sync. We can
> decide on the exact format of the comment to make sure that automated tool
> can easily parse the comment.
>
> - Cham
>
> On Wed, Jun 27, 2018 at 1:45 PM Yifan Zou  wrote:
>
> Thanks Scott, I will supplement the missing packages to the spreadsheet.
> And, we expect this being kept up to date along with the Beam project
> growth. Shall we mention this in the Dependency Guide page
> <https://beam.apache.org/contribute/dependencies/>, @Chamikara Jayalath
> ?
>
> On Wed, Jun 27, 2018 at 11:17 AM Scott Wegner  wrote:
>
> Thanks for kicking off this process Yifan-- I'll add my name to some
> dependencies I'm familiar with.
>
> Do you expect this to be a one-time process, or will we maintain the
> owners over time? If we will maintain this list, it would be easier to keep
> it up-to-date if it was closer to the code. i.e. perhaps each dependency
> registration in the Gradle BeamModulePlugin [1] should include a list of
> owners.
>
> [1] https://github.com/apache/beam/blob/master/buildSrc/src/
> main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L325
>
> On Wed, Jun 27, 2018 at 8:52 AM Yifan Zou  wrote:
>
> Hi all,
>
> We now have the automated detections for Beam dependency updates and
> sending a weekly report to dev mailing list. In order to address the
> updates in time, we want to find owners for all dependencies of Beam, and
> finally, Jira bugs will be automatically created and assigned to the owners
> if actions need to be taken. We also welcome non-owners to upgrade
> dependency packages, but only owners will receive the Jira tickets.
>
> Please review the spreadsheet Beam SDK Dependency Ownership
> <https://docs.google.com/spreadsheets/d/12NN3vPqFTBQtXBc0fg4sFIb9c_mgst0IDePB_0Ui8kE/edit?ts=5b32bec1#gid=0>
>  and
> sign off if you are familiar with any Beam dependencies and willing to
> take in charge of them. It is definitely fine that a single package have
> multiple owners. The more owners we have, the more helps we will get to
> keep Beam dependencies in a healthy state.
>
> Thank you :)
>
> Regards.
> Yifan
>
> https://docs.google.com/spreadsheets/d/12NN3vPqFTBQtXBc0fg4sFIb9c_
> mgst0IDePB_0Ui8kE/edit?ts=5b32bec1#gid=0
>
>


Re: ErrorProne and -Werror enabled for all Java projects

2018-06-27 Thread Tim
Thanks also to you Scott

Tim

> On 27 Jun 2018, at 18:39, Scott Wegner  wrote:
> 
> Six weeks ago [1] we began an effort to improve the quality of the Java 
> codebase via ErrorProne static analysis, and promoting compiler warnings to 
> errors. As of today, all of our Java projects have been migrated and this is 
> now the default setting for Beam [2].
> 
> This was a community effort. The cleanup spanned 48 JIRA issues [3] and 46 
> pull requests [4]. I want to give a big thanks to everyone who helped out: 
> Ismaël Mejía, Tim Robertson, Cade Markegard, and Teng Peng. 
> 
> Thanks!
> 
> [1] 
> https://lists.apache.org/thread.html/cdc729b6349f952d8db78bae99fff74b06b60918cbe09344e075ba35@%3Cdev.beam.apache.org%3E
> [2] https://github.com/apache/beam/pull/5773 
> [3] 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20errorprone
>  
> [4] 
> https://github.com/apache/beam/pulls?utf8=%E2%9C%93=is%3Apr+errorprone+merged%3A%3E%3D2018-05-16+
>  


Re: [DISCUSS] Automation for Java code formatting

2018-06-26 Thread Tim Robertson
++1


On Wed, Jun 27, 2018 at 7:36 AM, Ahmet Altay  wrote:

> +1
>
> This is great idea. Does anyone know a similar tool for python? I believe
> go already has this as part of its tools with go fmt.
>
>
> On Tue, Jun 26, 2018 at 9:55 PM, Ankur Goenka  wrote:
>
>> +1
>>
>> Intellij can help but still formatting is an additional thing to keep in
>> mind. Enabling auto formatting at gradle level will remove this additional
>> thing to keep in mind.
>>
>> On Tue, Jun 26, 2018 at 9:49 PM Eugene Kirpichov 
>> wrote:
>>
>>> +1!
>>>
>>> In some cases the temptation to format code manually can be quite
>>> strong, but the ease of just re-running the formatter after any change
>>> (especially after global changes like class/method renames) overweighs it.
>>> I lost count of the times when I wasted a precommit because some line
>>> became >100 characters after a refactoring. I especially love that there's
>>> a gradle task that does this for you - I used to manually run
>>> google-java-format-diff.
>>>
>>> On Tue, Jun 26, 2018 at 9:38 PM Rafael Fernandez 
>>> wrote:
>>>
 +1! Remove guesswork :D



 On Tue, Jun 26, 2018 at 9:15 PM Kenneth Knowles  wrote:

> Hi all,
>
> I like readable code, but I don't like formatting it myself. And I
> _really_ don't like discussing in code review. "Spotless" [1] can enforce 
> -
> and automatically apply - automatic formatting for Java, Groovy, and some
> others.
>
> This is not about style or wanting a particular layout. This is about
> automation, contributor experience, and streamlining review
>
>  - Contributor experience: MUCH better than checkstyle: error message
> just says "run ./gradlew :beam-your-module:spotlessApply" instead of
> telling them to go in and manually edit.
>
>  - Automation: You want to use autoformat so you don't have to format
> code by hand. But if you autoformat a file that was in some other format,
> then you touch a bunch of unrelated lines. If the file is already
> autoformatted, it is much better.
>
>  - Review: Never talk about code formatting ever again. A PR also
> needs baseline to already be autoformatted or formatting will make it
> unclear which lines are really changed.
>
> This is already available via applyJavaNature(enableSpotless: true)
> and it is turned on for SQL and our buildSrc gradle plugins. It is very
> nice. There is a JIRA [2] to turn it on for the hold code base. 
> Personally,
> I think (a) every module could make a different choice if the main
> contributors feel strongly and (b) it is objectively better to always
> autoformat :-)
>
> WDYT? If we do it, it is trivial to add it module-at-a-time or
> globally. If someone conflicts with a massive autoformat commit, they can
> just keep their changes and autoformat them and it is done.
>
> Kenn
>
> [1] https://github.com/diffplug/spotless/tree/master/plugin-gradle
> [2] https://issues.apache.org/jira/browse/BEAM-4394
>
>
>


Re: [CANCEL][VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-13 Thread Tim Robertson
Hi Pablo,

I'm afraid I couldn't find one either... there is an issue about it [1]
which is old so it doesn't look likely to be resolved either.

If you have time (sorry I am a bit busy) could you please verify the
version does work if you install that version locally? I know the maven
version of that [2] but not sure on the gradle equivalent. If we know it
works, we can then find a repository that fits ok with Apache/Beam policy.

Alternatively, we could consider using a fully qualified reference (i.e.
@edu.umd.cs.findbugs.annotations.SuppressWarnings) to the deprecated
version and leave the dependency at the 1.3.9-1. I believe our general
direction is to remove findbugs when errorprone covers all aspects so I
*expect* this should be considered reasonable.

I hope this helps,
Tim

[1] https://github.com/stephenc/findbugs-annotations/issues/4
[2] https://maven.apache.org/guides/mini/guide-3rd-party-jars-local.html

On Wed, Jun 13, 2018 at 8:39 AM, Pablo Estrada  wrote:

> Hi Tim,
> you're right. Thanks for pointing that out. There's just one problem that
> I'm running into now: The 3.0.1-1 version does not seem to be available in
> Maven Central[1]. Looking at the website, I am not quite sure if there's
> another repository where they do stage the newer versions?[2]
>
> -P
>
> [1] https://repo.maven.apache.org/maven2/com/github/
> stephenc/findbugs/findbugs-annotations/
> [2] http://stephenc.github.io/findbugs-annotations/
>
> On Tue, Jun 12, 2018 at 11:10 PM Tim Robertson 
> wrote:
>
>> Hi Pablo,
>>
>> I took only a quick look.
>>
>> "- The JAR from the non-LGPL findbugs does not contain the
>> SuppressFBWarnings annotation"
>>
>> Unless I misunderstand you it looks like SuppressFBWarnings was added in
>> Stephen's version in this commit [1] which was introduced in version
>> 2.0.3-1 -  I've checked is in the 3.0.1-1 build [2]
>> I notice in your commits [1] you've been exploring version 3.0.0 already
>> though... what happens when you use 3.0.1-1? It sounds like the wrong
>> version is coming in rather than the annotation being missing.
>>
>> Thanks,
>> Tim
>>
>> [1] https://github.com/stephenc/findbugs-annotations/
>> commits/master/src/main/java/edu/umd/cs/findbugs/
>> annotations/SuppressWarnings.java
>> [2] https://github.com/stephenc/findbugs-annotations/releases
>> [3] https://github.com/apache/beam/pull/5609/commits/
>> 32c7df706e970557f154ff6bc521b2e00f9d09ab
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 13, 2018 at 2:37 AM, Pablo Estrada 
>> wrote:
>>
>>> Hi all,
>>> I'll humbly declare that after wrestling with he build to stop depending
>>> on the wrong findbugs_annotations, I feel somewhat lost. The issue is
>>> actually quite small:
>>>
>>> - The JAR from the non-LGPL findbugs does not contain the
>>> SuppressFBWarnings annotation. This means that when building, ByteBuddy
>>> produces a few warnings (nothing critical).
>>> - The easiest way to avoid this failure is to call
>>> applyJavaNature(failOnWarning: false), but this would be bad, since we want
>>> to keep a high standard for tasks like ErrorProne and FindBugs itself.
>>> - So I find myself lost: How do we suppress trivial warnings coming from
>>> missing annotations, and honor warnings coming from other plugins?
>>>
>>> Any help / a PR from someone more capable would be appreciated.
>>> Best
>>> -P.
>>>
>>> On Tue, Jun 12, 2018 at 3:02 PM Ismaël Mejía  wrote:
>>>
>>>> Yes, ok I was not aware it was already being addressed, nice.
>>>> On Tue, Jun 12, 2018 at 11:56 PM Ahmet Altay  wrote:
>>>> >
>>>> > Ismaël,
>>>> >
>>>> > I believe Pablo's https://github.com/apache/beam/pull/5609 is fixing
>>>> the issue by changing the findbugs back to "com.github.stephenc.findbugs".
>>>> Is this what you are referring to?
>>>> >
>>>> > Ahmet
>>>> >
>>>> > On Tue, Jun 12, 2018 at 2:51 PM, Boyuan Zhang 
>>>> wrote:
>>>> >>
>>>> >> Hey JB,
>>>> >>
>>>> >> I added some instructions about how to create python wheels in this
>>>> PR: https://github.com/apache/beam-site/pull/467 . Hope it would be
>>>> helpful.
>>>> >>
>>>> >> Boyuan
>>>> >>
>>>> >
>>>>
>>> --
>>> Got feedback? go/pabloem-feedback
>>> <https://goto.google.com/pabloem-feedback>
>>>
>>
>> --
> Got feedback? go/pabloem-feedback
>


Re: [CANCEL][VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-13 Thread Tim Robertson
Hi Pablo,

I took only a quick look.

"- The JAR from the non-LGPL findbugs does not contain the
SuppressFBWarnings annotation"

Unless I misunderstand you it looks like SuppressFBWarnings was added in
Stephen's version in this commit [1] which was introduced in version
2.0.3-1 -  I've checked is in the 3.0.1-1 build [2]
I notice in your commits [1] you've been exploring version 3.0.0 already
though... what happens when you use 3.0.1-1? It sounds like the wrong
version is coming in rather than the annotation being missing.

Thanks,
Tim

[1]
https://github.com/stephenc/findbugs-annotations/commits/master/src/main/java/edu/umd/cs/findbugs/annotations/SuppressWarnings.java
[2] https://github.com/stephenc/findbugs-annotations/releases
[3]
https://github.com/apache/beam/pull/5609/commits/32c7df706e970557f154ff6bc521b2e00f9d09ab







On Wed, Jun 13, 2018 at 2:37 AM, Pablo Estrada  wrote:

> Hi all,
> I'll humbly declare that after wrestling with he build to stop depending
> on the wrong findbugs_annotations, I feel somewhat lost. The issue is
> actually quite small:
>
> - The JAR from the non-LGPL findbugs does not contain the
> SuppressFBWarnings annotation. This means that when building, ByteBuddy
> produces a few warnings (nothing critical).
> - The easiest way to avoid this failure is to call
> applyJavaNature(failOnWarning: false), but this would be bad, since we want
> to keep a high standard for tasks like ErrorProne and FindBugs itself.
> - So I find myself lost: How do we suppress trivial warnings coming from
> missing annotations, and honor warnings coming from other plugins?
>
> Any help / a PR from someone more capable would be appreciated.
> Best
> -P.
>
> On Tue, Jun 12, 2018 at 3:02 PM Ismaël Mejía  wrote:
>
>> Yes, ok I was not aware it was already being addressed, nice.
>> On Tue, Jun 12, 2018 at 11:56 PM Ahmet Altay  wrote:
>> >
>> > Ismaël,
>> >
>> > I believe Pablo's https://github.com/apache/beam/pull/5609 is fixing
>> the issue by changing the findbugs back to "com.github.stephenc.findbugs".
>> Is this what you are referring to?
>> >
>> > Ahmet
>> >
>> > On Tue, Jun 12, 2018 at 2:51 PM, Boyuan Zhang 
>> wrote:
>> >>
>> >> Hey JB,
>> >>
>> >> I added some instructions about how to create python wheels in this
>> PR: https://github.com/apache/beam-site/pull/467 . Hope it would be
>> helpful.
>> >>
>> >> Boyuan
>> >>
>> >
>>
> --
> Got feedback? go/pabloem-feedback
>


Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-10 Thread Tim
Tested by our team:
- mvn inclusion
- Avro, ES, Hadoop IF IO
- Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
- Reviewed release notes

+1 

Thanks also to everyone who helped get over the gradle hurdle and in particular 
to JB.

Tim

> On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré  wrote:
> 
> No problem Pablo.
> 
> The vote period is a minimum, it can be extended as requested or if we
> don't have the minimum of 3 binding votes.
> 
> Regards
> JB
> 
>> On 09/06/2018 01:54, Pablo Estrada wrote:
>> Hello all,
>> I'd like to request an extension of the voting period until Monday
>> evening (US time, so later in other geographical regions). This is
>> because we were only now able to publish Dataflow Workers, and have not
>> had the chance to run release validation tests on them. The extension
>> will allow us to validate and vote by Monday.
>> Is this acceptable to the community?
>> 
>> Best
>> -P.
>> 
>> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
>> mailto:aromanenko@gmail.com>> wrote:
>> 
>>Thank you JB for your work!
>> 
>>I tested running simple streaming (/KafkaIO/) and batch (/TextIO /
>>HDFS/) pipelines with SparkRunner on YARN cluster - it works fine.
>> 
>>WBR,
>>Alexey
>> 
>> 
>>>On 8 Jun 2018, at 10:00, Etienne Chauchot >><mailto:echauc...@apache.org>> wrote:
>>> 
>>>I forgot to vote:
>>>+1 (non binding). 
>>>What I tested:
>>>- no functional or performance regression comparing to v2.4
>>>- dependencies in the poms are ok
>>> 
>>>Etienne
>>>>Le vendredi 08 juin 2018 à 08:27 +0200, Romain Manni-Bucau a écrit :
>>>>+1 (non-binding), mainstream usage is not broken by the pom
>>>>changes and runtime has no known regression compared to the 2.4.0
>>>> 
>>>>(side note: kudo to JB for this build tool change release, I know
>>>>how it can hurt ;))
>>>> 
>>>>Romain Manni-Bucau
>>>>@rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>><https://rmannibucau.metawerx.net/> | Old Blog
>>>><http://rmannibucau.wordpress.com/> | Github
>>>><https://github.com/rmannibucau> | LinkedIn
>>>><https://www.linkedin.com/in/rmannibucau> | Book
>>>>
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>> 
>>>> 
>>>>Le jeu. 7 juin 2018 à 16:17, Jean-Baptiste Onofré
>>>>mailto:j...@nanthrax.net>> a écrit :
>>>>>Thanks for the details Etienne !
>>>>> 
>>>>>The good news is that the artifacts seem OK and the overall Nexmark
>>>>>results are consistent with the 2.4.0 release ones.
>>>>> 
>>>>>I'm starting a complete review using the beam-samples as well.
>>>>> 
>>>>>Regards
>>>>>JB
>>>>> 
>>>>>>On 07/06/2018 16:14, Etienne Chauchot wrote:
>>>>>> Hi,
>>>>>> I've just run the nexmark queries on v2.5.0-RC1 tag
>>>>>> What we can notice:
>>>>>> - query 3 (exercises CoGroupByKey, state and timer) shows
>>>>>different
>>>>>> output with DR between batch and streaming and with the other
>>>>>runners =>
>>>>>> I compared with v2.4 there were still these differences but with
>>>>>> different output size numbers
>>>>>> 
>>>>>> - query 6 (exercises specialized combiner) shows different output
>>>>>> between the runners => the correct output is 401. strange that
>>>>>in batch
>>>>>> mode some runners output les Sellers. I compared with v2.4
>>>>>same output
>>>>>> 
>>>>>> - response time of query 7 (exercices Max transform, fanout
>>>>>and side
>>>>>> input) is very slow on DR => I compared with v2.4 , comparable
>>>>>execution
>>>>>> times
>>>>>> 
>>>>>> I'm not comparing q10 because it is a write to GCS so it is
>>>>>very specific.
>>>>>> 
>>>>>> => Basically no regression comparing to v2.4
>>>>>> 
>>>>>> For the record here is 

Re: [ANNOUNCEMENT] New committers, May 2018 edition!

2018-05-31 Thread Tim
Congratulations!

Tim

> On 1 Jun 2018, at 07:05, Andrew Psaltis  wrote:
> 
> Congrats!
> 
>> On Fri, Jun 1, 2018 at 12:26 AM, Thomas Weise  wrote:
>> Congrats!
>> 
>> 
>>> On Thu, May 31, 2018 at 9:25 PM, Alan Myrvold  wrote:
>>> Congrats Gris+Pablo+Jason. Well deserved.
>>> 
>>>> On Thu, May 31, 2018 at 9:15 PM Jason Kuster  
>>>> wrote:
>>>> Thank you to Davor and the PMC; I'm excited to be able to help Beam in 
>>>> this new capacity. Bring on the PRs. :D
>>>> 
>>>>> On Thu, May 31, 2018 at 8:55 PM Xin Wang  wrote:
>>>>> Congrats!
>>>>> 
>>>>> - Xin Wang
>>>>> 
>>>>> 2018-06-01 11:52 GMT+08:00 Rui Wang :
>>>>>> Congrats!
>>>>>> 
>>>>>> -Rui
>>>>>> 
>>>>>>> On Thu, May 31, 2018 at 8:23 PM Jean-Baptiste Onofré 
>>>>>>>  wrote:
>>>>>>> Congrats !
>>>>>>> 
>>>>>>> Regards
>>>>>>> JB
>>>>>>> 
>>>>>>> On 01/06/2018 04:08, Davor Bonaci wrote:
>>>>>>> > Please join me and the rest of Beam PMC in welcoming the following
>>>>>>> > contributors as our newest committers. They have significantly
>>>>>>> > contributed to the project in different ways, and we look forward to
>>>>>>> > many more contributions in the future.
>>>>>>> > 
>>>>>>> > * Griselda Cuevas
>>>>>>> > * Pablo Estrada
>>>>>>> > * Jason Kuster
>>>>>>> > 
>>>>>>> > (Apologizes for a delayed announcement, and the lack of the usual
>>>>>>> > paragraph summarizing individual contributions.)
>>>>>>> > 
>>>>>>> > Congratulations to all three! Welcome!
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Thanks,
>>>>> Xin
>>>> 
>>>> 
>>>> -- 
>>>> ---
>>>> Jason Kuster
>>>> Apache Beam / Google Cloud Dataflow
>>>> 
>>>> See something? Say something. go/jasonkuster-feedback
>> 
> 


Re: [DISCUSS] Remove findbugs from sdks/java

2018-05-17 Thread Tim Robertson
Thank you all.

I think this is clear.  Removing findbugs can happen at a future point.

@Scott - I've been working through the java IO error prone issues (some
already merged, some with open PRs now) so will take those IO Jiras. I will
enable failOnWarning, address dependency issues for findbugs and tackle the
error prone warnings.


On Fri, May 18, 2018 at 1:07 AM, Scott Wegner <sweg...@google.com> wrote:

> +0.02173913
>
> I'm happy to replace FindBugs with ErrorProne, but we need to first
> upgrade ErrorProne analyzer warnings to errors. Currently the codebase is
> full of warning spam, and there's no enforcement preventing future
> violations from being added.
>
> I've done the work for enforcing ErrorProne analysis on java-sdk-core [1],
> and I've sharded out the rest of the Java components in JIRA issues [2] (45
> total).  Fixing the issues is relatively straightforward, and I've tried to
> provide enough guidance to make them as starter tasks (example: [3]). Teng
> Peng has already started on Spark [4] (thanks!)
>
> [1] https://github.com/apache/beam/pull/5319
> [2] https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20errorprone
> [3] https://issues.apache.org/jira/browse/BEAM-4347
> [4] https://issues.apache.org/jira/browse/BEAM-4318
>
> On Thu, May 17, 2018 at 2:00 PM Ismaël Mejía <ieme...@gmail.com> wrote:
>
>> +0.7 also. Findbugs support for more recent versions of Java is lacking
>> and
>> the maintenance seems frozen in time.
>>
>> As a possible plan b can we identify the missing important validations to
>> identify how much we lose and if it is considerable, maybe we can create a
>> minimal configuration for those, and eventually migrate from findbugs to
>> spotbugs (https://github.com/spotbugs/spotbugs/) that seems at least to
>> be
>> maintained and the most active findbugs fork.
>>
>>
>> On Thu, May 17, 2018 at 9:31 PM Kenneth Knowles <k...@google.com> wrote:
>>
>> > +0.7 I think we should work to remove findbugs. Errorprone covers most
>> of
>> the same stuff but better and faster.
>>
>> > The one thing I'm not sure about is nullness analysis. Findbugs has some
>> serious limitations there but it really improves code quality and prevents
>> blunders. I'm not sure errorprone covers that. I know the Checker analyzer
>> has a full solution that makes NPE impossible as in most modern languages.
>> Maybe that is easy to plug in. The core Java SDK is a good candidate for
>> the first place to do it since it is affects everything else.
>>
>> > On Thu, May 17, 2018 at 12:02 PM Tim Robertson <
>> timrobertson...@gmail.com>
>> wrote:
>>
>> >> Hi all,
>> >> [bringing a side thread discussion from slack to here]
>>
>> >> We're tackling error-prone warnings now and we aim to fail the build on
>> warnings raised [1].
>>
>> >> Enabling failOnWarning also fails the build on findbugs warnings.
>> Currently I see places where these  arise from missing a dependency on
>> findbugs_annotations and I asked on slack the best way to introduce this
>> globally in gradle.
>>
>> >> In that discussion the idea was floated to consider removing findbugs
>> completely given it is older, has licensing considerations and is not
>> released regularly.
>>
>> >> What do people think about this idea please?
>>
>> >> Thanks,
>> >> Tim
>> >> [1]
>> https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a
>> 7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E
>>
>


[DISCUSS] Remove findbugs from sdks/java

2018-05-17 Thread Tim Robertson
Hi all,
[bringing a side thread discussion from slack to here]

We're tackling error-prone warnings now and we aim to fail the build on
warnings raised [1].

Enabling failOnWarning also fails the build on findbugs warnings.
Currently I see places where these  arise from missing a dependency on
findbugs_annotations and I asked on slack the best way to introduce this
globally in gradle.

In that discussion the idea was floated to consider removing findbugs
completely given it is older, has licensing considerations and is not
released regularly.

What do people think about this idea please?

Thanks,
Tim
[1]
https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E


Re: ElasticsearchIOTest failed during gradle build

2018-05-17 Thread Tim Robertson
Hey folks,

I am new to gradle, but Boyuan and I had a chat on the slack beam late last
night (CEST) on this.

Here are my notes I've collected from my build attempts but I haven't yet
isolated the problem:

  - seemingly only happens with -PisRelease
  - need --info and --stacktrace or else you miss detail
  - it is sporadic and happens on different projects
  - gradle caches come in to play (subsequent build might pass the stage)
- race condition?
- I remove ~/.graddle each time
  - I suspected jar signing - but I have commented that out and the issue
remains
  - zip exceptions I have seen include:
 -  archive is not a ZIP archive
 - invalid block type
 - too many length or distance symbols
  - It is using the zip reader org.apache.tools.zip.ZipFile (from Ant I
believe)

I hope this helps,
Tim


On Thu, May 17, 2018 at 3:15 PM, Etienne Chauchot <echauc...@apache.org>
wrote:

> Hey,
> Thanks for pointing out ! I'll take a look. Very strange ZipException
>
> Etienne
>
> Le mercredi 16 mai 2018 à 11:50 -0700, Boyuan Zhang a écrit :
>
> Hey all,
>
> I'm working on debugging the process of release process and when running
> ./gradlew -PisRelease clean build, I got several tests failed. Here is one
> build scan: https://scans.gradle.com/s/t4ryx7y3jhdeo/console-log?
> task=:beam-sdks-java-io-elasticsearch-tests-5:test#L3. Any idea about why
> this happened?
>
> Thanks for all your help!
>
> Boyuan
>
>


Re: Jackson serialisation of GenericJson subclasses

2018-05-11 Thread Tim Robertson
You're very welcome.  Glad you have it sorted.


On Fri, May 11, 2018 at 12:48 PM, Carlos Alonso <car...@mrcalonso.com>
wrote:

> Hi Tim, many thanks for your help. It's definitely interesting, but
> unfortunately not useful this time, I think, as that JsonTypeInfo and
> JsonSubClasses annotations are on the base class, which, in my case, I
> don't own and even if I did, I don't think I could list all the subclasses
> GenericJson has.
>
> I've discovered, though, that I can configure the ObjectMapper to add that
> type information to all objects using 'enableDefaultTyping' https://
> fasterxml.github.io/jackson-databind/javadoc/2.8/com/
> fasterxml/jackson/databind/ObjectMapper.html#enableDefaultTyping()
>
> Thanks!
>
> On Wed, May 9, 2018 at 8:01 PM Tim Robertson <timrobertson...@gmail.com>
> wrote:
>
>> Hi Carlos
>>
>> Here is an example of subclassing with Jackson using the @Type annotation
>> that I did many years ago:
>>   https://github.com/gbif/gbif-api/tree/master/src/main/java/
>> org/gbif/api/model/registry/eml/temporal
>>
>> It decorates the JSON with an extra field ("@Type" in this case) which
>> instructs the deserializers which Object to instantiate. I'm not sure if
>> newer Jackson versions have changed.
>>
>> I haven't considered if this is appropriate or not in your case, but I
>> hope this helps with the Jackson bit of your question at least.
>>
>> Best wishes,
>> Tim
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 9, 2018 at 7:02 PM, Carlos Alonso <car...@mrcalonso.com>
>> wrote:
>>
>>> Hi everyone!!
>>>
>>> I'm working on BEAM-4257 issue and the approach I'm following is to
>>> create a new class 'BigQueryInsertError' that also extends 'GenericJson'
>>> and that contains three keys 'TableRow row', 
>>> 'TableDataInsertAllResponse.InsertErrors
>>> error', and 'TableReference ref' and use this type as the contained items
>>> returned by WriteResults.getFailedInserts
>>>
>>> I have now to create a Coder for this new type and I'm following the
>>> TableRowJsonCoder way https://github.com/apache/
>>> beam/blob/master/sdks/java/io/google-cloud-platform/src/
>>> main/java/org/apache/beam/sdk/io/gcp/bigquery/TableRowJsonCoder.java#L34 by
>>> relying on Jackson's ObjectMapper and StringUtf8Encoder.
>>>
>>> The problem is that I always get errors when deserialising as it
>>> deserialises the inner TableRow as a LinkedHashMap and fails when trying to
>>> assign it. Here you can see the full stacktrace: https://pastebin.
>>> com/MkUD9L3W
>>>
>>> Testing it a bit further I've spotted other GenericJson subclasses that
>>> cannot be encoded/decoded following that method. For example
>>> TableDataInsertAllResponse.InsertErrors itself. See the example below:
>>>
>>> TableDataInsertAllResponse.InsertErrors err = new 
>>> TableDataInsertAllResponse.InsertErrors().setIndex(0L);
>>> ObjectMapper mapper = new 
>>> ObjectMapper().disable(SerializationFeature.FAIL_ON_EMPTY_BEANS);
>>> mapper.readValue(mapper.writeValueAsString(err), 
>>> TableDataInsertAllResponse.InsertErrors.class);
>>>
>>>
>>> Fails with a similar error, but in this case is because it deserialises
>>> the index into an int: https://pastebin.com/bzXMR3z5
>>>
>>> So a couple of questions here:
>>> * Which is the appropriate way of encoding/decoding GenericJson
>>> subclasses? (Maybe this issues can be tackled using Jackson's type
>>> annotations, but I'm quite a newbie on Jackson and I couldn't figure out
>>> how)
>>> * This will (hopefully) be my very first contribution to Apache Beam and
>>> I'd like to get some feedback/comments/ideas/... on the issue and the
>>> suggested solution.
>>>
>>> Thanks everyone!
>>>
>>
>>


Re: Jackson serialisation of GenericJson subclasses

2018-05-09 Thread Tim Robertson
Hi Carlos

Here is an example of subclassing with Jackson using the @Type annotation
that I did many years ago:

https://github.com/gbif/gbif-api/tree/master/src/main/java/org/gbif/api/model/registry/eml/temporal

It decorates the JSON with an extra field ("@Type" in this case) which
instructs the deserializers which Object to instantiate. I'm not sure if
newer Jackson versions have changed.

I haven't considered if this is appropriate or not in your case, but I hope
this helps with the Jackson bit of your question at least.

Best wishes,
Tim







On Wed, May 9, 2018 at 7:02 PM, Carlos Alonso <car...@mrcalonso.com> wrote:

> Hi everyone!!
>
> I'm working on BEAM-4257 issue and the approach I'm following is to create
> a new class 'BigQueryInsertError' that also extends 'GenericJson' and that
> contains three keys 'TableRow row', 'TableDataInsertAllResponse.InsertErrors
> error', and 'TableReference ref' and use this type as the contained items
> returned by WriteResults.getFailedInserts
>
> I have now to create a Coder for this new type and I'm following the
> TableRowJsonCoder way https://github.com/apache/
> beam/blob/master/sdks/java/io/google-cloud-platform/src/
> main/java/org/apache/beam/sdk/io/gcp/bigquery/TableRowJsonCoder.java#L34 by
> relying on Jackson's ObjectMapper and StringUtf8Encoder.
>
> The problem is that I always get errors when deserialising as it
> deserialises the inner TableRow as a LinkedHashMap and fails when trying to
> assign it. Here you can see the full stacktrace: https://pastebin.
> com/MkUD9L3W
>
> Testing it a bit further I've spotted other GenericJson subclasses that
> cannot be encoded/decoded following that method. For example
> TableDataInsertAllResponse.InsertErrors itself. See the example below:
>
> TableDataInsertAllResponse.InsertErrors err = new 
> TableDataInsertAllResponse.InsertErrors().setIndex(0L);
> ObjectMapper mapper = new 
> ObjectMapper().disable(SerializationFeature.FAIL_ON_EMPTY_BEANS);
> mapper.readValue(mapper.writeValueAsString(err), 
> TableDataInsertAllResponse.InsertErrors.class);
>
>
> Fails with a similar error, but in this case is because it deserialises
> the index into an int: https://pastebin.com/bzXMR3z5
>
> So a couple of questions here:
> * Which is the appropriate way of encoding/decoding GenericJson
> subclasses? (Maybe this issues can be tackled using Jackson's type
> annotations, but I'm quite a newbie on Jackson and I couldn't figure out
> how)
> * This will (hopefully) be my very first contribution to Apache Beam and
> I'd like to get some feedback/comments/ideas/... on the issue and the
> suggested solution.
>
> Thanks everyone!
>


Re: DirectRunner in test - await completion of workers threads?

2018-04-05 Thread Tim Robertson
Will do - I'll report the result on https://github.com/apache/beam/pull/4905


On Thu, Apr 5, 2018 at 11:45 AM, Ismaël Mejía <ieme...@gmail.com> wrote:

> For info, Romain's PR was merged today, can you confirm if this fixes
> the issue Tim.
>
> On Sun, Apr 1, 2018 at 9:21 PM, Tim Robertson <timrobertson...@gmail.com>
> wrote:
> > Thanks all.
> >
> > I went with what I outlined above, which you can see in this test.
> > https://github.com/timrobertson100/beam/blob/
> BEAM-3848/sdks/java/io/solr/src/test/java/org/apache/beam/
> sdk/io/solr/SolrIOTest.java#L285
> >
> > That forms part of this PR https://github.com/apache/beam/pull/4956
> >
> > I'll monitor Romain's PR and back it out when appropriate.
> >
> >
> >
> >
> >
> > On Sun, Apr 1, 2018 at 8:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >>
> >> Indeed. It's exactly what Romain's PR is about.
> >>
> >> Regards
> >> JB
> >> Le 1 avr. 2018, à 19:33, Reuven Lax <re...@google.com> a écrit:
> >>>
> >>> Correct - teardown is currently run in the direct runner, but
> >>> asynchronously. I believe Romain's pending PRs should solve this for
> your
> >>> use case.
> >>>
> >>> On Sun, Apr 1, 2018 at 3:13 AM Tim Robertson <
> timrobertson...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Thanks for confirming Romain - also for the very fast reply!
> >>>>
> >>>> I'll continue with the workaround and reference BEAM-3409 inline as
> >>>> justification.
> >>>> I'm trying to wrap this up before travel next week, but if I get a
> >>>> chance I'll try and run this scenario (BEAM-3848) with your patch.
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Apr 1, 2018 at 12:05 PM, Romain Manni-Bucau
> >>>> <rmannibu...@gmail.com> wrote:
> >>>>>
> >>>>> Hi
> >>>>>
> >>>>> I have the same blocker and created
> >>>>>
> >>>>> https://github.com/apache/beam/pull/4790 and
> >>>>> https://github.com/apache/beam/pull/4965 to solve part of it
> >>>>>
> >>>>>
> >>>>>
> >>>>> Le 1 avr. 2018 11:35, "Tim Robertson" < timrobertson...@gmail.com> a
> >>>>> écrit :
> >>>>>
> >>>>> Hi devs
> >>>>>
> >>>>> I'm working on SolrIO tests for failure scenarios (i.e. an exception
> >>>>> will come out of the pipeline execution).  I see that the exception
> is
> >>>>> surfaced to the driver while " direct-runner-worker" threads are
> still
> >>>>> running.  This causes issue because:
> >>>>>
> >>>>>   1. The Solr tests do thread leak detection, and a
> solrClient.close()
> >>>>> is what removes the object
> >>>>>   2. @Teardown is not necessarily called which is what would close
> the
> >>>>> solrClient
> >>>>>
> >>>>> I can unregister all the solrClients that have been spawned.
> However I
> >>>>> have seen race conditions where there are still threads running
> creating and
> >>>>> registering clients. I need to someone ensure that all workers
> related to
> >>>>> the pipeline execution are indeed finished so no new ones are
> created after
> >>>>> the first exception is passed up.
> >>>>>
> >>>>> Currently I have this (psuedo code) which works, but I suspect
> someone
> >>>>> can suggest a better approach:
> >>>>>
> >>>>> // store the state of clients registered for object leak check
> >>>>> Set existingClients = registeredSolrClients();
> >>>>> try {
> >>>>>   pipeline.run();
> >>>>>
> >>>>> } catch (Pipeline.PipelineExecutionException e) {
> >>>>>
> >>>>>
> >>>>>   // Hack: await all bundle workers completing
> >>>>>   while (namedThreadStillExists("direct-runner-worker")) {
> >>>>> Thread.sleep(100);
> >>>>>   }
> >>>>>
> >>>>>   // remove all solrClients created in this execution only
> >>>>>   // since the teardown may not have done so
> >>>>>   for (Object o : ObjectReleaseTracker.OBJECTS.keySet()) {
> >>>>> if (o instanceof SolrClient && !existingClients.contains(o)) {
> >>>>>   ObjectReleaseTracker.release(o);
> >>>>> }
> >>>>>   }
> >>>>>
> >>>>>   // now we can do our assertions
> >>>>>
> >>>>> expectedLogs.verifyWarn(String.format(SolrIO.Write.
> WriteFn.RETRY_ATTEMPT_LOG,
> >>>>> 1));
> >>>>>
> >>>>>
> >>>>> Please do point out the obvious if I am missing it - I am a newbie
> >>>>> here...
> >>>>>
> >>>>> Thank you all very much,
> >>>>> Tim
> >>>>> ( timrobertson...@gmail.com on the slack apache/beam channel)
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >
>


Re: DirectRunner in test - await completion of workers threads?

2018-04-01 Thread Tim Robertson
Thanks all.

I went with what I outlined above, which you can see in this test.
https://github.com/timrobertson100/beam/blob/BEAM-3848/sdks/java/io/solr/src/test/java/org/apache/beam/sdk/io/solr/SolrIOTest.java#L285

That forms part of this PR https://github.com/apache/beam/pull/4956

I'll monitor Romain's PR and back it out when appropriate.





On Sun, Apr 1, 2018 at 8:20 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Indeed. It's exactly what Romain's PR is about.
>
> Regards
> JB
> Le 1 avr. 2018, à 19:33, Reuven Lax <re...@google.com> a écrit:
>
>> Correct - teardown is currently run in the direct runner, but
>> asynchronously. I believe Romain's pending PRs should solve this for your
>> use case.
>>
>> On Sun, Apr 1, 2018 at 3:13 AM Tim Robertson < timrobertson...@gmail.com>
>> wrote:
>>
>>> Thanks for confirming Romain - also for the very fast reply!
>>>
>>> I'll continue with the workaround and reference BEAM-3409 inline as
>>> justification.
>>> I'm trying to wrap this up before travel next week, but if I get a
>>> chance I'll try and run this scenario (BEAM-3848) with your patch.
>>>
>>>
>>>
>>> On Sun, Apr 1, 2018 at 12:05 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> I have the same blocker and created
>>>>
>>>> https://github.com/apache/beam/pull/4790 and
>>>> https://github.com/apache/beam/pull/4965 to solve part of it
>>>>
>>>>
>>>>
>>>> Le 1 avr. 2018 11:35, "Tim Robertson" < timrobertson...@gmail.com> a
>>>> écrit :
>>>>
>>>> Hi devs
>>>>
>>>> I'm working on SolrIO tests for failure scenarios (i.e. an exception
>>>> will come out of the pipeline execution).  I see that the exception is
>>>> surfaced to the driver while " direct-runner-worker" threads are still
>>>> running.  This causes issue because:
>>>>
>>>>   1. The Solr tests do thread leak detection, and a solrClient.close()
>>>> is what removes the object
>>>>   2. @Teardown is not necessarily called which is what would close the
>>>> solrClient
>>>>
>>>> I can unregister all the solrClients that have been spawned.  However I
>>>> have seen race conditions where there are still threads running creating
>>>> and registering clients. I need to someone ensure that all workers related
>>>> to the pipeline execution are indeed finished so no new ones are created
>>>> after the first exception is passed up.
>>>>
>>>> Currently I have this (psuedo code) which works, but I suspect someone
>>>> can suggest a better approach:
>>>>
>>>> // store the state of clients registered for object leak check
>>>> Set existingClients = registeredSolrClients();
>>>> try {
>>>>   pipeline.run();
>>>>
>>>> } catch (Pipeline.PipelineExecutionException e) {
>>>>
>>>>
>>>> // Hack: await all bundle workers completing
>>>> while (namedThreadStillExists("direct-runner-worker")) {
>>>> Thread.sleep(100);
>>>> }
>>>>
>>>> // remove all solrClients created in this execution only
>>>> // since the teardown may not have done so
>>>> for (Object o : ObjectReleaseTracker.OBJECTS.keySet()) {
>>>> if (o instanceof SolrClient && !existingClients.contains(o)) {
>>>> ObjectReleaseTracker.release(o);
>>>> }
>>>> }
>>>>
>>>> // now we can do our assertions
>>>> expectedLogs.verifyWarn(String.format(SolrIO.Write.WriteFn.
>>>> RETRY_ATTEMPT_LOG, 1));
>>>>
>>>>
>>>> Please do point out the obvious if I am missing it - I am a newbie
>>>> here...
>>>>
>>>> Thank you all very much,
>>>> Tim
>>>> ( timrobertson...@gmail.com on the slack apache/beam channel)
>>>>
>>>>
>>>>
>>>>
>>>


Starter issue

2017-01-11 Thread Tim Taschke
Hi,

I would like to get started with contributing and thought I'd start
with this, if that is ok:
https://issues.apache.org/jira/browse/BEAM-1056

Could somebody please assign it to me?

Best regards,
Tim