Re: Projects Can Apply Individually for Google Season of Docs

2019-04-17 Thread Ahmet Altay
Thanks Aizhamal, I completed the forms.

On Wed, Apr 17, 2019 at 6:46 PM Aizhamal Nurmamat kyzy 
wrote:

> Hi everyone,
>
> Here are a few updates on our application for Season of Docs:
>
> 1. Pablo and I have created the following document [1] with some of the
> project ideas shared in the mailing list. If you have more ideas, please
> add them into the doc and provide description. If you also want to be a
> mentor for the proposed ideas, please add your name in the table.
>
> 2. To submit our application, we need to publish our project ideas list.
> For this we have opened a Jira tickets with “gsod2019” tag[2]. We should
> maybe also think of adding a small blog post in the Beam website that
> contains all the ideas in one place[3]? Please let me know what you think
> on this.
>
> 3. By next week Tuesday (Application Deadline)
>
>-
>
>+pabl...@apache.org  , please complete the org
>application form [4]
>-
>
>@Ahmet Altay  , please complete alternative
>administrator form [5]
>-
>
>@pabl...@apache.org  , @Ahmet Altay
>  , and all other contributors that want to
>participate as mentors, please complete the mentor registration form [6]
>
>
> Thank you,
>
> Aizhamal
>
>
> [1]
> https://docs.google.com/document/d/1FNf-BjB4Q7PDdqygPboLr7CyIeo6JAkrt0RBgs2I4dE/edit#
>
> [2]
> https://issues.apache.org/jira/browse/BEAM-7104?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20gsod2019
>
> [3] https://beam.apache.org/blog/
>
> [4]
> https://docs.google.com/forms/d/e/1FAIpQLScrEq5yKmadgn7LEPC8nN811-6DNmYvus5uXv_JY5BX7CH-Bg/viewform
>
> [5]
> https://docs.google.com/forms/d/e/1FAIpQLSc5ZsBzqfsib-epktZp8bYxL_hO4RhT_Zz8AY6zXDHB79ue9g/viewform
> [6]
> https://docs.google.com/forms/d/e/1FAIpQLSe-JjGvaKKGWZOXxrorONhB8qN3mjPrB9ZVkcsntR73Cv_K7g/viewform
>
> On Wed, Apr 10, 2019 at 2:57 PM Pablo Estrada  wrote:
>
>> I'd be happy to be a mentor for this to help add getting started
>> documentation for Python on Flink. I'd want to focus on the reviews and
>> less on the administration - so I'm willing to be a secondary administrator
>> if that's necessary to move forward, but I'd love it if someone would help
>> administer.
>> FWIW, neither the administrator nor any other mentor has to be a
>> committer.
>>
>> Anyone willing to be primary administrator and also a mentor?
>>
>> Thanks
>> -P.
>>
>> On Fri, Apr 5, 2019 at 9:40 AM Kenneth Knowles  wrote:
>>
>>> Yes, this is great. Thanks for noticing the call and pushing ahead on
>>> this, Aizhamal!
>>>
>>> I would also like to see the runner comparison revamp at
>>> https://issues.apache.org/jira/browse/BEAM-2888 which would help users
>>> really understand what they can and cannot do in plain terms.
>>>
>>> Kenn
>>>
>>> On Fri, Apr 5, 2019 at 9:30 AM Ahmet Altay  wrote:
>>>
 Thank you Aizhamal for volunteering. I am happy to help as an
 administrator.

 cc: +Rose Nguyen  +Melissa Pashniak
  in case they will be interested in mentorship
 and/or administration.




 On Fri, Apr 5, 2019 at 9:16 AM Thomas Weise  wrote:

> This is great. Beam documentation needs work in several areas, Python
> SDK, portability and SQL come to mind right away :)
>
>
> On Thu, Apr 4, 2019 at 4:21 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
>
>> Hello everyone,
>>
>> As the ASF announced that each project can apply for Season of Docs
>> individually, I would like to volunteer to be one of the administrators 
>> for
>> the program. Is this fine for everyone in the community? If so, I will
>> start working on application on behalf of Beam this week, and I will send
>> updates on this thread with progress.
>>
>> The program requires two administrators, so any volunteers would be
>> appreciated. I’m happy to take on the administrative load, and partner 
>> with
>> committers or PMC members. We will also need at least two mentors for the
>> program, to onboard tech writers to the project and work with them 
>> closely
>> during 3 months period. Please express your interest in the thread :)
>>
>> If you have some ideas to work on for Season of Docs, please let me
>> know directly, or file a JIRA issue, and add the "gsod" and "gsod2019"
>> labels to it. It will help us to gather ideas and put them together in 
>> the
>> application.
>>
>> Thanks everybody,
>> Aizhamal
>>
>>
>> On Wed, Apr 3, 2019 at 1:55 PM  wrote:
>>
>>> Hi All
>>>
>>> Initially the ASF as an organisation was planning to apply as a
>>> mentoring organisation for Google Season of Docs on behalf of all Apache
>>> projects but if accepted the maximum number of technical writers we 
>>> could
>>> allocated is two. Two technical writers would probably not be enough to
>>> cover the potential demand from all our projects interested in
>>> 

Re: Go SDK status

2019-04-17 Thread Robert Burke
Oh dang. Thanks for mentioning that! Here's an open copy of the versioning
thoughts doc, though there shouldn't be any surprises from the points I
mentioned above.

https://docs.google.com/document/d/1ZjP30zNLWTu_WzkWbgY8F_ZXlA_OWAobAD9PuohJxPg/edit#heading=h.drpipq762xi7

On Wed, 17 Apr 2019 at 21:20, Nathan Fisher  wrote:

> Hi Robert,
>
> Great summary on the current state of play. FYI the referenced G doc
> doesn't appear to people outside the org as a default.
>
> Great to hear the Go SDK is still getting love. I last looked at in
> September-October of last year.
>
> Cheers,
> Nathan
>
> On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik  wrote:
>
>> Thanks for the indepth summary.
>>
>> On Mon, Apr 15, 2019 at 4:19 PM Robert Burke  wrote:
>>
>>> Hi Thomas! I'm so glad you asked!
>>>
>>> The status of the Go SDK is complicated, so this email can't be brief.
>>> There's are several dimensions to consider: as a Go Open Source Project,
>>> User Libraries and Experience, and on Beam Features.
>>>
>>> I'm going to be updating the roadmap later this month when I have a
>>> spare moment.
>>>
>>> *tl;dr;*
>>> I would *love* help in improving the Go SDK, especially around
>>> interactions with Java/Python/Flink. Java and I do not have a good working
>>> relationship for operational purposes, and the last time I used Python, I
>>> had to re-image my machine. There's lots to do, but shouting out tasks to
>>> the void is rarely as productive as it is cathartic. If there's an offer to
>>> help, and a preference for/experience with  something to work on, I'm
>>> willing to find something useful to get started on for you.
>>>
>>> (Note: The following are simply my opinion as someone who works with the
>>> project weekly as a Go programmer, and should not be treated as demands or
>>> gospel. I just don't have anyone to talk about Go SDK issues with, and my
>>> previous discussions, have largely seemed to fall on uninterested ears.)
>>>
>>> *The SDK can be considered Alpha when all of the following are true:*
>>> * The SDK is tested by the Beam project on a ULR and on Flink as well as
>>> Dataflow.
>>> * The IOs have received some love to ensure they can scale (either
>>> through SDF or reshuffles), and be portable to different environments (eg.
>>> using the Go Cloud Development Kit (CDK) libraries).
>>>* Cross-Language IO support would also be acceptable.
>>> * The SDK is using Go Modules for dependency management, marking it as
>>> version 0.Minor (where Minor should probably track the mainline Beam minor
>>> version for now).
>>>
>>> *We can move to calling it Beta when all of the following are true:*
>>> * The all implemented Beam features are meaningfully tested on the
>>> portable runners (eg. a proper "Validates Runner" suite exists in Go)
>>> * The SDK is properly documented on the Beam site, and in it's Go Docs.
>>>
>>> After this, I'll be more comfortable recommending it as something folks
>>> can use for production.
>>> That said, there are happy paths that are useable today in batch
>>> situations.
>>>
>>> *Intro*
>>> The Go SDK is a purely Beam Portable SDK. If it runs on a distributed
>>> system at all, it's being run portably. Currently it's regularly tested on
>>> Google Cloud Dataflow (though Dataflow doesn't officially support the SDK
>>> at this time), and on it's own single bundle Direct Runner (intended for
>>> unit testing purposes). In addition, it's being tested at scale within
>>> Google, on an internal runner, where it presently satisfies our performance
>>> benchmarks, and correctness tests.
>>>
>>> I've been working on cases to make the SDK suitable for data processing
>>> within Google. This unfortunately makes my contributions more towards
>>> general SDK usability, documentation, and performance, rather than "making
>>> it usable outside Google". Note this also precludes necessary work to
>>> resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I
>>> believe that the SDK must become a good member of the Go ecosystem, the
>>> Beam ecosystem.
>>>
>>> Improved Go Docs, are on their way, and Daniel Oliviera has been helping
>>> me make the "getting started" experience better by improving pipeline
>>> construction time error messages.
>>>
>>> Finally many of the following issues have JIRAs already, some don't. It
>>> would take me time I don't have to audit and line everything up for this
>>> email, please look before you file JIRAs for things mentioned below, should
>>> the urge strike you.
>>>
>>>
>>> *As a Go Open Source Project*As an open source project written in Go,
>>> the SDK is lagging on adopting Go Modules for Dependency Management and
>>> Versioning.
>>>
>>> Using Go Modules which would ensure that what the Beam project
>>> infrastructure is testing what users are getting.  I'm very happy to
>>> elaborate on this, and have a bit I wrote about it two months ago on the
>>> topic[1]. But I loathe sending out plans for things that I don't have time
>>> to work on, so 

Re: Go SDK status

2019-04-17 Thread Nathan Fisher
Hi Robert,

Great summary on the current state of play. FYI the referenced G doc
doesn't appear to people outside the org as a default.

Great to hear the Go SDK is still getting love. I last looked at in
September-October of last year.

Cheers,
Nathan

On Wed, 17 Apr 2019 at 20:27, Lukasz Cwik  wrote:

> Thanks for the indepth summary.
>
> On Mon, Apr 15, 2019 at 4:19 PM Robert Burke  wrote:
>
>> Hi Thomas! I'm so glad you asked!
>>
>> The status of the Go SDK is complicated, so this email can't be brief.
>> There's are several dimensions to consider: as a Go Open Source Project,
>> User Libraries and Experience, and on Beam Features.
>>
>> I'm going to be updating the roadmap later this month when I have a spare
>> moment.
>>
>> *tl;dr;*
>> I would *love* help in improving the Go SDK, especially around
>> interactions with Java/Python/Flink. Java and I do not have a good working
>> relationship for operational purposes, and the last time I used Python, I
>> had to re-image my machine. There's lots to do, but shouting out tasks to
>> the void is rarely as productive as it is cathartic. If there's an offer to
>> help, and a preference for/experience with  something to work on, I'm
>> willing to find something useful to get started on for you.
>>
>> (Note: The following are simply my opinion as someone who works with the
>> project weekly as a Go programmer, and should not be treated as demands or
>> gospel. I just don't have anyone to talk about Go SDK issues with, and my
>> previous discussions, have largely seemed to fall on uninterested ears.)
>>
>> *The SDK can be considered Alpha when all of the following are true:*
>> * The SDK is tested by the Beam project on a ULR and on Flink as well as
>> Dataflow.
>> * The IOs have received some love to ensure they can scale (either
>> through SDF or reshuffles), and be portable to different environments (eg.
>> using the Go Cloud Development Kit (CDK) libraries).
>>* Cross-Language IO support would also be acceptable.
>> * The SDK is using Go Modules for dependency management, marking it as
>> version 0.Minor (where Minor should probably track the mainline Beam minor
>> version for now).
>>
>> *We can move to calling it Beta when all of the following are true:*
>> * The all implemented Beam features are meaningfully tested on the
>> portable runners (eg. a proper "Validates Runner" suite exists in Go)
>> * The SDK is properly documented on the Beam site, and in it's Go Docs.
>>
>> After this, I'll be more comfortable recommending it as something folks
>> can use for production.
>> That said, there are happy paths that are useable today in batch
>> situations.
>>
>> *Intro*
>> The Go SDK is a purely Beam Portable SDK. If it runs on a distributed
>> system at all, it's being run portably. Currently it's regularly tested on
>> Google Cloud Dataflow (though Dataflow doesn't officially support the SDK
>> at this time), and on it's own single bundle Direct Runner (intended for
>> unit testing purposes). In addition, it's being tested at scale within
>> Google, on an internal runner, where it presently satisfies our performance
>> benchmarks, and correctness tests.
>>
>> I've been working on cases to make the SDK suitable for data processing
>> within Google. This unfortunately makes my contributions more towards
>> general SDK usability, documentation, and performance, rather than "making
>> it usable outside Google". Note this also precludes necessary work to
>> resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I
>> believe that the SDK must become a good member of the Go ecosystem, the
>> Beam ecosystem.
>>
>> Improved Go Docs, are on their way, and Daniel Oliviera has been helping
>> me make the "getting started" experience better by improving pipeline
>> construction time error messages.
>>
>> Finally many of the following issues have JIRAs already, some don't. It
>> would take me time I don't have to audit and line everything up for this
>> email, please look before you file JIRAs for things mentioned below, should
>> the urge strike you.
>>
>>
>> *As a Go Open Source Project*As an open source project written in Go,
>> the SDK is lagging on adopting Go Modules for Dependency Management and
>> Versioning.
>>
>> Using Go Modules which would ensure that what the Beam project
>> infrastructure is testing what users are getting.  I'm very happy to
>> elaborate on this, and have a bit I wrote about it two months ago on the
>> topic[1]. But I loathe sending out plans for things that I don't have time
>> to work on, so it's only coming to light now.
>>
>> The short points are:
>> * Go is opinionated about versioning since Go 1.11, when Modules were
>> introduced. They allow for reproducible builds with versioned deps,
>> supported by the Go language tools.
>> * Packages 1 & greater are beholden to not make breaking changes. We're
>> not yet there with the SDK yet (certainly not a 2.11 product), so IMO the
>> SDK should be considered v0.X
>> * I 

Re: Projects Can Apply Individually for Google Season of Docs

2019-04-17 Thread Aizhamal Nurmamat kyzy
Hi everyone,

Here are a few updates on our application for Season of Docs:

1. Pablo and I have created the following document [1] with some of the
project ideas shared in the mailing list. If you have more ideas, please
add them into the doc and provide description. If you also want to be a
mentor for the proposed ideas, please add your name in the table.

2. To submit our application, we need to publish our project ideas list.
For this we have opened a Jira tickets with “gsod2019” tag[2]. We should
maybe also think of adding a small blog post in the Beam website that
contains all the ideas in one place[3]? Please let me know what you think
on this.

3. By next week Tuesday (Application Deadline)

   -

   +pabl...@apache.org  , please complete the org
   application form [4]
   -

   @Ahmet Altay  , please complete alternative
   administrator form [5]
   -

   @pabl...@apache.org  , @Ahmet Altay
 , and all other contributors that want to
   participate as mentors, please complete the mentor registration form [6]


Thank you,

Aizhamal


[1]
https://docs.google.com/document/d/1FNf-BjB4Q7PDdqygPboLr7CyIeo6JAkrt0RBgs2I4dE/edit#

[2]
https://issues.apache.org/jira/browse/BEAM-7104?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20gsod2019

[3] https://beam.apache.org/blog/

[4]
https://docs.google.com/forms/d/e/1FAIpQLScrEq5yKmadgn7LEPC8nN811-6DNmYvus5uXv_JY5BX7CH-Bg/viewform

[5]
https://docs.google.com/forms/d/e/1FAIpQLSc5ZsBzqfsib-epktZp8bYxL_hO4RhT_Zz8AY6zXDHB79ue9g/viewform
[6]
https://docs.google.com/forms/d/e/1FAIpQLSe-JjGvaKKGWZOXxrorONhB8qN3mjPrB9ZVkcsntR73Cv_K7g/viewform

On Wed, Apr 10, 2019 at 2:57 PM Pablo Estrada  wrote:

> I'd be happy to be a mentor for this to help add getting started
> documentation for Python on Flink. I'd want to focus on the reviews and
> less on the administration - so I'm willing to be a secondary administrator
> if that's necessary to move forward, but I'd love it if someone would help
> administer.
> FWIW, neither the administrator nor any other mentor has to be a committer.
>
> Anyone willing to be primary administrator and also a mentor?
>
> Thanks
> -P.
>
> On Fri, Apr 5, 2019 at 9:40 AM Kenneth Knowles  wrote:
>
>> Yes, this is great. Thanks for noticing the call and pushing ahead on
>> this, Aizhamal!
>>
>> I would also like to see the runner comparison revamp at
>> https://issues.apache.org/jira/browse/BEAM-2888 which would help users
>> really understand what they can and cannot do in plain terms.
>>
>> Kenn
>>
>> On Fri, Apr 5, 2019 at 9:30 AM Ahmet Altay  wrote:
>>
>>> Thank you Aizhamal for volunteering. I am happy to help as an
>>> administrator.
>>>
>>> cc: +Rose Nguyen  +Melissa Pashniak
>>>  in case they will be interested in mentorship
>>> and/or administration.
>>>
>>>
>>>
>>>
>>> On Fri, Apr 5, 2019 at 9:16 AM Thomas Weise  wrote:
>>>
 This is great. Beam documentation needs work in several areas, Python
 SDK, portability and SQL come to mind right away :)


 On Thu, Apr 4, 2019 at 4:21 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:

> Hello everyone,
>
> As the ASF announced that each project can apply for Season of Docs
> individually, I would like to volunteer to be one of the administrators 
> for
> the program. Is this fine for everyone in the community? If so, I will
> start working on application on behalf of Beam this week, and I will send
> updates on this thread with progress.
>
> The program requires two administrators, so any volunteers would be
> appreciated. I’m happy to take on the administrative load, and partner 
> with
> committers or PMC members. We will also need at least two mentors for the
> program, to onboard tech writers to the project and work with them closely
> during 3 months period. Please express your interest in the thread :)
>
> If you have some ideas to work on for Season of Docs, please let me
> know directly, or file a JIRA issue, and add the "gsod" and "gsod2019"
> labels to it. It will help us to gather ideas and put them together in the
> application.
>
> Thanks everybody,
> Aizhamal
>
>
> On Wed, Apr 3, 2019 at 1:55 PM  wrote:
>
>> Hi All
>>
>> Initially the ASF as an organisation was planning to apply as a
>> mentoring organisation for Google Season of Docs on behalf of all Apache
>> projects but if accepted the maximum number of technical writers we could
>> allocated is two. Two technical writers would probably not be enough to
>> cover the potential demand from all our projects interested in
>> participating.
>>
>> We've received feedback from Google that individual projects can
>> apply. I will withdraw the ASF application so that any Apache project
>> interested can apply individually for Season of Docs and so have the
>> potential of being allocated a technical writer.
>>

Re: Go SDK status

2019-04-17 Thread Lukasz Cwik
Thanks for the indepth summary.

On Mon, Apr 15, 2019 at 4:19 PM Robert Burke  wrote:

> Hi Thomas! I'm so glad you asked!
>
> The status of the Go SDK is complicated, so this email can't be brief.
> There's are several dimensions to consider: as a Go Open Source Project,
> User Libraries and Experience, and on Beam Features.
>
> I'm going to be updating the roadmap later this month when I have a spare
> moment.
>
> *tl;dr;*
> I would *love* help in improving the Go SDK, especially around
> interactions with Java/Python/Flink. Java and I do not have a good working
> relationship for operational purposes, and the last time I used Python, I
> had to re-image my machine. There's lots to do, but shouting out tasks to
> the void is rarely as productive as it is cathartic. If there's an offer to
> help, and a preference for/experience with  something to work on, I'm
> willing to find something useful to get started on for you.
>
> (Note: The following are simply my opinion as someone who works with the
> project weekly as a Go programmer, and should not be treated as demands or
> gospel. I just don't have anyone to talk about Go SDK issues with, and my
> previous discussions, have largely seemed to fall on uninterested ears.)
>
> *The SDK can be considered Alpha when all of the following are true:*
> * The SDK is tested by the Beam project on a ULR and on Flink as well as
> Dataflow.
> * The IOs have received some love to ensure they can scale (either through
> SDF or reshuffles), and be portable to different environments (eg. using
> the Go Cloud Development Kit (CDK) libraries).
>* Cross-Language IO support would also be acceptable.
> * The SDK is using Go Modules for dependency management, marking it as
> version 0.Minor (where Minor should probably track the mainline Beam minor
> version for now).
>
> *We can move to calling it Beta when all of the following are true:*
> * The all implemented Beam features are meaningfully tested on the
> portable runners (eg. a proper "Validates Runner" suite exists in Go)
> * The SDK is properly documented on the Beam site, and in it's Go Docs.
>
> After this, I'll be more comfortable recommending it as something folks
> can use for production.
> That said, there are happy paths that are useable today in batch
> situations.
>
> *Intro*
> The Go SDK is a purely Beam Portable SDK. If it runs on a distributed
> system at all, it's being run portably. Currently it's regularly tested on
> Google Cloud Dataflow (though Dataflow doesn't officially support the SDK
> at this time), and on it's own single bundle Direct Runner (intended for
> unit testing purposes). In addition, it's being tested at scale within
> Google, on an internal runner, where it presently satisfies our performance
> benchmarks, and correctness tests.
>
> I've been working on cases to make the SDK suitable for data processing
> within Google. This unfortunately makes my contributions more towards
> general SDK usability, documentation, and performance, rather than "making
> it usable outside Google". Note this also precludes necessary work to
> resolve issues with running Go SDK pipelines on Google Cloud Dataflow. I
> believe that the SDK must become a good member of the Go ecosystem, the
> Beam ecosystem.
>
> Improved Go Docs, are on their way, and Daniel Oliviera has been helping
> me make the "getting started" experience better by improving pipeline
> construction time error messages.
>
> Finally many of the following issues have JIRAs already, some don't. It
> would take me time I don't have to audit and line everything up for this
> email, please look before you file JIRAs for things mentioned below, should
> the urge strike you.
>
>
> *As a Go Open Source Project*As an open source project written in Go, the
> SDK is lagging on adopting Go Modules for Dependency Management and
> Versioning.
>
> Using Go Modules which would ensure that what the Beam project
> infrastructure is testing what users are getting.  I'm very happy to
> elaborate on this, and have a bit I wrote about it two months ago on the
> topic[1]. But I loathe sending out plans for things that I don't have time
> to work on, so it's only coming to light now.
>
> The short points are:
> * Go is opinionated about versioning since Go 1.11, when Modules were
> introduced. They allow for reproducible builds with versioned deps,
> supported by the Go language tools.
> * Packages 1 & greater are beholden to not make breaking changes. We're
> not yet there with the SDK yet (certainly not a 2.11 product), so IMO the
> SDK should be considered v0.X
> * I don't think it's reasonable to move SDK languages in lockstep with the
> project. Eg. The Go language is considering adopting Generics, which may
> necessitate a Major Version Change to the SDK user surface as it's modified
> to support them. It's not reasonable to move all of beam to a new version
> due to a single language surface.
>* This isn't an issue since it reads: the Go SDK version 

Re: New contributor to Beam

2019-04-17 Thread Chamikara Jayalath
Welcome Cyrus!

On Wed, Apr 17, 2019 at 1:59 PM Matthias Baetens 
wrote:

> Welcome! 
>
> On Wed, Apr 17, 2019, 20:59 Alan Myrvold  wrote:
>
>> Welcome, Cyrus!
>>
>> On Wed, Apr 17, 2019 at 12:49 PM Ahmet Altay  wrote:
>>
>>> Welcome!
>>>
>>> On Wed, Apr 17, 2019 at 12:26 PM Rose Nguyen 
>>> wrote:
>>>
 Welcome, Cyrus!!

 On Wed, Apr 17, 2019 at 11:58 AM Niklas Hansson <
 niklas.sven.hans...@gmail.com> wrote:

> Welcome :)
>
> Den ons 17 apr. 2019 kl 20:33 skrev Aizhamal Nurmamat kyzy <
> aizha...@google.com>:
>
>> Welcome Cyrus! We'd love so much to have better docs for Beam [image:
>> ][image: ][image: ] Thank you!
>>
>>
>> On Wed, Apr 17, 2019 at 11:28 AM Joana Filipa Bernardo Carrasqueira <
>> joanafil...@google.com> wrote:
>>
>>> Welcome Cyrus!
>>>
>>> On Wed, Apr 17, 2019 at 11:18 AM Kyle Weaver 
>>> wrote:
>>>
 Welcome!

 On Wed, Apr 17, 2019 at 10:32 AM Robert Burke 
 wrote:

> Welcome Cyrus! :D Yay better docs!
>
> On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan <
> conne...@google.com> wrote:
>
>> Welcome Cyrus!!!
>>
>> On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin <
>> mig...@google.com> wrote:
>>
>>> Welcome!
>>>
>>> --Mikhail
>>>
>>> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak <
>>> meliss...@google.com> wrote:
>>>

 Welcome Cyrus!


 On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré <
 j...@nanthrax.net> wrote:

> Welcome !
>
> Regards
> JB
>
> On 17/04/2019 16:05, Cyrus Maden wrote:
> > Hi all!
> >
> > My name's Cyrus and I'd like to start contributing to Beam.
> I'm a
> > technical writer so I'm particularly looking forward to
> contributing to
> > the Beam docs. Could someone add me as a contributor on JIRA
> so I can
> > create and assign tickets?
> >
> > My JIRA name is: *cyrusmaden*
> > *
> > *
> > Excited to be a part of this community and to work with
> ya'll!
> >
> > Best,
> > Cyrus
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
 --
 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com | +1650203

>>>
>>>

 --
 Rose Thị Nguyễn

>>>


Re: Python SDK timestamp precision

2019-04-17 Thread Kenneth Knowles
For Robert's benefit, I want to point out that my proposal is to support
femtosecond data, with femtosecond-scale windows, even if watermarks/event
timestamps/holds are only millisecond precision.

So the workaround once I have time, for SQL and schema-based transforms,
will be to have a logical type that matches the Java and protobuf
definition of nanos (seconds-since-epoch + nanos-in-second) to preserve the
user's data. And then when doing windowing inserting the necessary rounding
somewhere in the SQL or schema layers.

Kenn

On Wed, Apr 17, 2019 at 3:13 PM Robert Burke  wrote:

> +1 for plan B. Nano second precision on windowing seems... a little much
> for a system that's aggregating data over time. Even for processing say
> particle super collider data, they'd get away with artificially increasing
> the granularity in batch settings.
>
> Now if they were streaming... they'd probably want femtoseconds anyway.
> The point is, we should see if users demand it before adding in the
> necessary work.
>
> On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath 
> wrote:
>
>> +1 for plan B as well. I think it's important to make timestamp precision
>> consistent now without introducing surprising behaviors for existing users.
>> But we should move towards a higher granularity timestamp precision in the
>> long run to support use-cases that Beam users otherwise might miss out (on
>> a runner that supports such precision).
>>
>> - Cham
>>
>> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik  wrote:
>>
>>> I also like Plan B because in the cross language case, the pipeline
>>> would not work since every party (Runners & SDKs) would have to be aware of
>>> the new beam:coder:windowed_value:v2 coder. Plan A has the property where
>>> if the SDK/Runner wasn't updated then it may start truncating the
>>> timestamps unexpectedly.
>>>
>>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik  wrote:
>>>
 Kenn, this discussion is about the precision of the timestamp in the
 user data. As you had mentioned, Runners need not have the same granularity
 of user data as long as they correctly round the timestamp to guarantee
 that triggers are executed correctly but the user data should have the same
 precision across SDKs otherwise user data timestamps will be truncated in
 cross language scenarios.

 Based on the systems that were listed, either microsecond or nanosecond
 would make sense. The issue with changing the precision is that all Beam
 runners except for possibly Beam Python on Dataflow are using millisecond
 precision since they are all using the same Java Runner windowing/trigger
 logic.

 Plan A: Swap precision to nanosecond
 1) Change the Python SDK to only expose millisecond precision
 timestamps (do now)
 2) Change the user data encoding to support nanosecond precision (do
 now)
 3) Swap runner libraries to be nanosecond precision aware updating all
 window/triggering logic (do later)
 4) Swap SDKs to expose nanosecond precision (do later)

 Plan B:
 1) Change the Python SDK to only expose millisecond precision
 timestamps and keep the data encoding as is (do now)
 (We could add greater precision later to plan B by creating a new
 version beam:coder:windowed_value:v2 which would be nanosecond and would
 require runners to correctly perform an internal conversions for
 windowing/triggering.)

 I think we should go with Plan B and when users request greater
 precision we can make that an explicit effort. What do people think?



 On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels 
 wrote:

> Hi,
>
> Thanks for taking care of this issue in the Python SDK, Thomas!
>
> It would be nice to have a uniform precision for timestamps but, as
> Kenn
> pointed out, timestamps are extracted from systems that have different
> precision.
>
> To add to the list: Flink - milliseconds
>
> After all, it doesn't matter as long as there is sufficient precision
> and conversions are done correctly.
>
> I think we could improve the situation by at least adding a
> "milliseconds" constructor to the Python SDK's Timestamp.
>
> Cheers,
> Max
>
> On 17.04.19 04:13, Kenneth Knowles wrote:
> > I am not so sure this is a good idea. Here are some systems and
> their
> > precision:
> >
> > Arrow - microseconds
> > BigQuery - microseconds
> > New Java instant - nanoseconds
> > Firestore - microseconds
> > Protobuf - nanoseconds
> > Dataflow backend - microseconds
> > Postgresql - microseconds
> > Pubsub publish time - nanoseconds
> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
> > Cassandra - milliseconds
> >
> > IMO it is important to be able to treat any of these as a Beam
> > timestamp, even though they aren't all streaming. Who knows 

Re: Python SDK timestamp precision

2019-04-17 Thread Kenneth Knowles
I am talking about the precision of timestamps in user data. I do not
believe plan A or B address what I am saying. As a user, I have a kafka
stream of Avro records with a timestamp-micros field. I need to be able to:

 - tell Beam this is the event timestamp for the record
 - run my WindowFn against this field with original precision

Consider SQL query that reads SELECT ... FROM stream GROUP BY
TUMBLE(stream.micro_timestamp, INTERVAL 100 microseconds).

 - stream.micro_timestamp has to be made into the event timestamp
 - the time interval is small and you must have the original data in order
to correctly assign the window
 - (I think both of these can be safely rounded under the hood)

This is not just a SQL issue; the new schema-based transforms will hit the
same issues. It is just easier to email SQL. And the same idea applies in
pure Java, though forcing things to joda time forces users to deal with
this. But see https://github.com/apache/beam/pull/8289 which I had to write
to work around this.

Some timestamp values that are observable to users that we can restrict
more safely:

 - end time of a window
 - output of a timestamp combiner
 - firing time for an event time timer

Kenn

On Wed, Apr 17, 2019 at 2:26 PM Chamikara Jayalath 
wrote:

> +1 for plan B as well. I think it's important to make timestamp precision
> consistent now without introducing surprising behaviors for existing users.
> But we should move towards a higher granularity timestamp precision in the
> long run to support use-cases that Beam users otherwise might miss out (on
> a runner that supports such precision).
>
> - Cham
>
> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik  wrote:
>
>> I also like Plan B because in the cross language case, the pipeline would
>> not work since every party (Runners & SDKs) would have to be aware of the
>> new beam:coder:windowed_value:v2 coder. Plan A has the property where if
>> the SDK/Runner wasn't updated then it may start truncating the timestamps
>> unexpectedly.
>>
>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik  wrote:
>>
>>> Kenn, this discussion is about the precision of the timestamp in the
>>> user data. As you had mentioned, Runners need not have the same granularity
>>> of user data as long as they correctly round the timestamp to guarantee
>>> that triggers are executed correctly but the user data should have the same
>>> precision across SDKs otherwise user data timestamps will be truncated in
>>> cross language scenarios.
>>>
>>> Based on the systems that were listed, either microsecond or nanosecond
>>> would make sense. The issue with changing the precision is that all Beam
>>> runners except for possibly Beam Python on Dataflow are using millisecond
>>> precision since they are all using the same Java Runner windowing/trigger
>>> logic.
>>>
>>> Plan A: Swap precision to nanosecond
>>> 1) Change the Python SDK to only expose millisecond precision timestamps
>>> (do now)
>>> 2) Change the user data encoding to support nanosecond precision (do now)
>>> 3) Swap runner libraries to be nanosecond precision aware updating all
>>> window/triggering logic (do later)
>>> 4) Swap SDKs to expose nanosecond precision (do later)
>>>
>>> Plan B:
>>> 1) Change the Python SDK to only expose millisecond precision timestamps
>>> and keep the data encoding as is (do now)
>>> (We could add greater precision later to plan B by creating a new
>>> version beam:coder:windowed_value:v2 which would be nanosecond and would
>>> require runners to correctly perform an internal conversions for
>>> windowing/triggering.)
>>>
>>> I think we should go with Plan B and when users request greater
>>> precision we can make that an explicit effort. What do people think?
>>>
>>>
>>>
>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels 
>>> wrote:
>>>
 Hi,

 Thanks for taking care of this issue in the Python SDK, Thomas!

 It would be nice to have a uniform precision for timestamps but, as
 Kenn
 pointed out, timestamps are extracted from systems that have different
 precision.

 To add to the list: Flink - milliseconds

 After all, it doesn't matter as long as there is sufficient precision
 and conversions are done correctly.

 I think we could improve the situation by at least adding a
 "milliseconds" constructor to the Python SDK's Timestamp.

 Cheers,
 Max

 On 17.04.19 04:13, Kenneth Knowles wrote:
 > I am not so sure this is a good idea. Here are some systems and their
 > precision:
 >
 > Arrow - microseconds
 > BigQuery - microseconds
 > New Java instant - nanoseconds
 > Firestore - microseconds
 > Protobuf - nanoseconds
 > Dataflow backend - microseconds
 > Postgresql - microseconds
 > Pubsub publish time - nanoseconds
 > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
 > Cassandra - milliseconds
 >
 > IMO it is important to be able to treat any of 

Re: What is preferred way to label Jira issues intended for new contributors?

2019-04-17 Thread Valentyn Tymofieiev
I incorrectly assumed that label set was restricted  somewhere in Jira
settings, but in fact one can tag JIRAs with new labels. It wasn't clear to
me which of the labels I should use to make the issue surface in a list of
beginner issues, so I thought that reducing the list may help. The filter
query in  https://s.apache.org/beam-starter-tasks answers my question.

Thanks.


On Wed, Apr 17, 2019 at 2:58 PM Kenneth Knowles  wrote:

> The only reference I know of is https://s.apache.org/beam-starter-tasks
> which includes even more tags. What is the goal of reducing the list? And
> how would you maintain it?
>
> Kenn
>
> On Wed, Apr 17, 2019 at 2:42 PM Valentyn Tymofieiev 
> wrote:
>
>> I am seeing at least 4 labels in JIRA that can be well applicable for
>> tagging  issues for someone getting started on Beam: beginner, easyfix,
>> newbie, starter.
>>
>> Are they materially different? Is it documented somewhere? If not, should
>> we perhaps reduce this list?
>>
>> Thanks,
>> Valentyn
>>
>


Re: Python SDK timestamp precision

2019-04-17 Thread Robert Burke
+1 for plan B. Nano second precision on windowing seems... a little much
for a system that's aggregating data over time. Even for processing say
particle super collider data, they'd get away with artificially increasing
the granularity in batch settings.

Now if they were streaming... they'd probably want femtoseconds anyway.
The point is, we should see if users demand it before adding in the
necessary work.

On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath 
wrote:

> +1 for plan B as well. I think it's important to make timestamp precision
> consistent now without introducing surprising behaviors for existing users.
> But we should move towards a higher granularity timestamp precision in the
> long run to support use-cases that Beam users otherwise might miss out (on
> a runner that supports such precision).
>
> - Cham
>
> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik  wrote:
>
>> I also like Plan B because in the cross language case, the pipeline would
>> not work since every party (Runners & SDKs) would have to be aware of the
>> new beam:coder:windowed_value:v2 coder. Plan A has the property where if
>> the SDK/Runner wasn't updated then it may start truncating the timestamps
>> unexpectedly.
>>
>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik  wrote:
>>
>>> Kenn, this discussion is about the precision of the timestamp in the
>>> user data. As you had mentioned, Runners need not have the same granularity
>>> of user data as long as they correctly round the timestamp to guarantee
>>> that triggers are executed correctly but the user data should have the same
>>> precision across SDKs otherwise user data timestamps will be truncated in
>>> cross language scenarios.
>>>
>>> Based on the systems that were listed, either microsecond or nanosecond
>>> would make sense. The issue with changing the precision is that all Beam
>>> runners except for possibly Beam Python on Dataflow are using millisecond
>>> precision since they are all using the same Java Runner windowing/trigger
>>> logic.
>>>
>>> Plan A: Swap precision to nanosecond
>>> 1) Change the Python SDK to only expose millisecond precision timestamps
>>> (do now)
>>> 2) Change the user data encoding to support nanosecond precision (do now)
>>> 3) Swap runner libraries to be nanosecond precision aware updating all
>>> window/triggering logic (do later)
>>> 4) Swap SDKs to expose nanosecond precision (do later)
>>>
>>> Plan B:
>>> 1) Change the Python SDK to only expose millisecond precision timestamps
>>> and keep the data encoding as is (do now)
>>> (We could add greater precision later to plan B by creating a new
>>> version beam:coder:windowed_value:v2 which would be nanosecond and would
>>> require runners to correctly perform an internal conversions for
>>> windowing/triggering.)
>>>
>>> I think we should go with Plan B and when users request greater
>>> precision we can make that an explicit effort. What do people think?
>>>
>>>
>>>
>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels 
>>> wrote:
>>>
 Hi,

 Thanks for taking care of this issue in the Python SDK, Thomas!

 It would be nice to have a uniform precision for timestamps but, as
 Kenn
 pointed out, timestamps are extracted from systems that have different
 precision.

 To add to the list: Flink - milliseconds

 After all, it doesn't matter as long as there is sufficient precision
 and conversions are done correctly.

 I think we could improve the situation by at least adding a
 "milliseconds" constructor to the Python SDK's Timestamp.

 Cheers,
 Max

 On 17.04.19 04:13, Kenneth Knowles wrote:
 > I am not so sure this is a good idea. Here are some systems and their
 > precision:
 >
 > Arrow - microseconds
 > BigQuery - microseconds
 > New Java instant - nanoseconds
 > Firestore - microseconds
 > Protobuf - nanoseconds
 > Dataflow backend - microseconds
 > Postgresql - microseconds
 > Pubsub publish time - nanoseconds
 > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
 > Cassandra - milliseconds
 >
 > IMO it is important to be able to treat any of these as a Beam
 > timestamp, even though they aren't all streaming. Who knows when we
 > might be ingesting a streamed changelog, or using them for
 reprocessing
 > an archived stream. I think for this purpose we either should
 > standardize on nanoseconds or make the runner's resolution
 independent
 > of the data representation.
 >
 > I've had some offline conversations about this. I think we can have
 > higher-than-runner precision in the user data, and allow WindowFns
 and
 > DoFns to operate on this higher-than-runner precision data, and still
 > have consistent watermark treatment. Watermarks are just bounds,
 after all.
 >
 > Kenn
 >
 > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise >>> > 

Re: What is preferred way to label Jira issues intended for new contributors?

2019-04-17 Thread Kenneth Knowles
The only reference I know of is https://s.apache.org/beam-starter-tasks
which includes even more tags. What is the goal of reducing the list? And
how would you maintain it?

Kenn

On Wed, Apr 17, 2019 at 2:42 PM Valentyn Tymofieiev 
wrote:

> I am seeing at least 4 labels in JIRA that can be well applicable for
> tagging  issues for someone getting started on Beam: beginner, easyfix,
> newbie, starter.
>
> Are they materially different? Is it documented somewhere? If not, should
> we perhaps reduce this list?
>
> Thanks,
> Valentyn
>


What is preferred way to label Jira issues intended for new contributors?

2019-04-17 Thread Valentyn Tymofieiev
I am seeing at least 4 labels in JIRA that can be well applicable for
tagging  issues for someone getting started on Beam: beginner, easyfix,
newbie, starter.

Are they materially different? Is it documented somewhere? If not, should
we perhaps reduce this list?

Thanks,
Valentyn


Re: Python SDK timestamp precision

2019-04-17 Thread Chamikara Jayalath
+1 for plan B as well. I think it's important to make timestamp precision
consistent now without introducing surprising behaviors for existing users.
But we should move towards a higher granularity timestamp precision in the
long run to support use-cases that Beam users otherwise might miss out (on
a runner that supports such precision).

- Cham

On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik  wrote:

> I also like Plan B because in the cross language case, the pipeline would
> not work since every party (Runners & SDKs) would have to be aware of the
> new beam:coder:windowed_value:v2 coder. Plan A has the property where if
> the SDK/Runner wasn't updated then it may start truncating the timestamps
> unexpectedly.
>
> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik  wrote:
>
>> Kenn, this discussion is about the precision of the timestamp in the user
>> data. As you had mentioned, Runners need not have the same granularity of
>> user data as long as they correctly round the timestamp to guarantee that
>> triggers are executed correctly but the user data should have the same
>> precision across SDKs otherwise user data timestamps will be truncated in
>> cross language scenarios.
>>
>> Based on the systems that were listed, either microsecond or nanosecond
>> would make sense. The issue with changing the precision is that all Beam
>> runners except for possibly Beam Python on Dataflow are using millisecond
>> precision since they are all using the same Java Runner windowing/trigger
>> logic.
>>
>> Plan A: Swap precision to nanosecond
>> 1) Change the Python SDK to only expose millisecond precision timestamps
>> (do now)
>> 2) Change the user data encoding to support nanosecond precision (do now)
>> 3) Swap runner libraries to be nanosecond precision aware updating all
>> window/triggering logic (do later)
>> 4) Swap SDKs to expose nanosecond precision (do later)
>>
>> Plan B:
>> 1) Change the Python SDK to only expose millisecond precision timestamps
>> and keep the data encoding as is (do now)
>> (We could add greater precision later to plan B by creating a new version
>> beam:coder:windowed_value:v2 which would be nanosecond and would require
>> runners to correctly perform an internal conversions for
>> windowing/triggering.)
>>
>> I think we should go with Plan B and when users request greater precision
>> we can make that an explicit effort. What do people think?
>>
>>
>>
>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels 
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks for taking care of this issue in the Python SDK, Thomas!
>>>
>>> It would be nice to have a uniform precision for timestamps but, as Kenn
>>> pointed out, timestamps are extracted from systems that have different
>>> precision.
>>>
>>> To add to the list: Flink - milliseconds
>>>
>>> After all, it doesn't matter as long as there is sufficient precision
>>> and conversions are done correctly.
>>>
>>> I think we could improve the situation by at least adding a
>>> "milliseconds" constructor to the Python SDK's Timestamp.
>>>
>>> Cheers,
>>> Max
>>>
>>> On 17.04.19 04:13, Kenneth Knowles wrote:
>>> > I am not so sure this is a good idea. Here are some systems and their
>>> > precision:
>>> >
>>> > Arrow - microseconds
>>> > BigQuery - microseconds
>>> > New Java instant - nanoseconds
>>> > Firestore - microseconds
>>> > Protobuf - nanoseconds
>>> > Dataflow backend - microseconds
>>> > Postgresql - microseconds
>>> > Pubsub publish time - nanoseconds
>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
>>> > Cassandra - milliseconds
>>> >
>>> > IMO it is important to be able to treat any of these as a Beam
>>> > timestamp, even though they aren't all streaming. Who knows when we
>>> > might be ingesting a streamed changelog, or using them for
>>> reprocessing
>>> > an archived stream. I think for this purpose we either should
>>> > standardize on nanoseconds or make the runner's resolution independent
>>> > of the data representation.
>>> >
>>> > I've had some offline conversations about this. I think we can have
>>> > higher-than-runner precision in the user data, and allow WindowFns and
>>> > DoFns to operate on this higher-than-runner precision data, and still
>>> > have consistent watermark treatment. Watermarks are just bounds, after
>>> all.
>>> >
>>> > Kenn
>>> >
>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise >> > > wrote:
>>> >
>>> > The Python SDK currently uses timestamps in microsecond resolution
>>> > while Java SDK, as most would probably expect, uses milliseconds.
>>> >
>>> > This causes a few difficulties with portability (Python coders need
>>> > to convert to millis for WindowedValue and Timers, which is related
>>> > to a bug I'm looking into:
>>> >
>>> > https://issues.apache.org/jira/browse/BEAM-7035
>>> >
>>> > As Luke pointed out, the issue was previously discussed:
>>> >
>>> > https://issues.apache.org/jira/browse/BEAM-1524
>>> >
>>> > I'm not 

Re: Python SDK timestamp precision

2019-04-17 Thread Lukasz Cwik
I also like Plan B because in the cross language case, the pipeline would
not work since every party (Runners & SDKs) would have to be aware of the
new beam:coder:windowed_value:v2 coder. Plan A has the property where if
the SDK/Runner wasn't updated then it may start truncating the timestamps
unexpectedly.

On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik  wrote:

> Kenn, this discussion is about the precision of the timestamp in the user
> data. As you had mentioned, Runners need not have the same granularity of
> user data as long as they correctly round the timestamp to guarantee that
> triggers are executed correctly but the user data should have the same
> precision across SDKs otherwise user data timestamps will be truncated in
> cross language scenarios.
>
> Based on the systems that were listed, either microsecond or nanosecond
> would make sense. The issue with changing the precision is that all Beam
> runners except for possibly Beam Python on Dataflow are using millisecond
> precision since they are all using the same Java Runner windowing/trigger
> logic.
>
> Plan A: Swap precision to nanosecond
> 1) Change the Python SDK to only expose millisecond precision timestamps
> (do now)
> 2) Change the user data encoding to support nanosecond precision (do now)
> 3) Swap runner libraries to be nanosecond precision aware updating all
> window/triggering logic (do later)
> 4) Swap SDKs to expose nanosecond precision (do later)
>
> Plan B:
> 1) Change the Python SDK to only expose millisecond precision timestamps
> and keep the data encoding as is (do now)
> (We could add greater precision later to plan B by creating a new version
> beam:coder:windowed_value:v2 which would be nanosecond and would require
> runners to correctly perform an internal conversions for
> windowing/triggering.)
>
> I think we should go with Plan B and when users request greater precision
> we can make that an explicit effort. What do people think?
>
>
>
> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels  wrote:
>
>> Hi,
>>
>> Thanks for taking care of this issue in the Python SDK, Thomas!
>>
>> It would be nice to have a uniform precision for timestamps but, as Kenn
>> pointed out, timestamps are extracted from systems that have different
>> precision.
>>
>> To add to the list: Flink - milliseconds
>>
>> After all, it doesn't matter as long as there is sufficient precision
>> and conversions are done correctly.
>>
>> I think we could improve the situation by at least adding a
>> "milliseconds" constructor to the Python SDK's Timestamp.
>>
>> Cheers,
>> Max
>>
>> On 17.04.19 04:13, Kenneth Knowles wrote:
>> > I am not so sure this is a good idea. Here are some systems and their
>> > precision:
>> >
>> > Arrow - microseconds
>> > BigQuery - microseconds
>> > New Java instant - nanoseconds
>> > Firestore - microseconds
>> > Protobuf - nanoseconds
>> > Dataflow backend - microseconds
>> > Postgresql - microseconds
>> > Pubsub publish time - nanoseconds
>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
>> > Cassandra - milliseconds
>> >
>> > IMO it is important to be able to treat any of these as a Beam
>> > timestamp, even though they aren't all streaming. Who knows when we
>> > might be ingesting a streamed changelog, or using them for reprocessing
>> > an archived stream. I think for this purpose we either should
>> > standardize on nanoseconds or make the runner's resolution independent
>> > of the data representation.
>> >
>> > I've had some offline conversations about this. I think we can have
>> > higher-than-runner precision in the user data, and allow WindowFns and
>> > DoFns to operate on this higher-than-runner precision data, and still
>> > have consistent watermark treatment. Watermarks are just bounds, after
>> all.
>> >
>> > Kenn
>> >
>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise > > > wrote:
>> >
>> > The Python SDK currently uses timestamps in microsecond resolution
>> > while Java SDK, as most would probably expect, uses milliseconds.
>> >
>> > This causes a few difficulties with portability (Python coders need
>> > to convert to millis for WindowedValue and Timers, which is related
>> > to a bug I'm looking into:
>> >
>> > https://issues.apache.org/jira/browse/BEAM-7035
>> >
>> > As Luke pointed out, the issue was previously discussed:
>> >
>> > https://issues.apache.org/jira/browse/BEAM-1524
>> >
>> > I'm not privy to the reasons why we decided to go with micros in
>> > first place, but would it be too big of a change or impractical for
>> > other reasons to switch Python SDK to millis before it gets more
>> users?
>> >
>> > Thanks,
>> > Thomas
>> >
>>
>


Re: Python SDK timestamp precision

2019-04-17 Thread Lukasz Cwik
Kenn, this discussion is about the precision of the timestamp in the user
data. As you had mentioned, Runners need not have the same granularity of
user data as long as they correctly round the timestamp to guarantee that
triggers are executed correctly but the user data should have the same
precision across SDKs otherwise user data timestamps will be truncated in
cross language scenarios.

Based on the systems that were listed, either microsecond or nanosecond
would make sense. The issue with changing the precision is that all Beam
runners except for possibly Beam Python on Dataflow are using millisecond
precision since they are all using the same Java Runner windowing/trigger
logic.

Plan A: Swap precision to nanosecond
1) Change the Python SDK to only expose millisecond precision timestamps
(do now)
2) Change the user data encoding to support nanosecond precision (do now)
3) Swap runner libraries to be nanosecond precision aware updating all
window/triggering logic (do later)
4) Swap SDKs to expose nanosecond precision (do later)

Plan B:
1) Change the Python SDK to only expose millisecond precision timestamps
and keep the data encoding as is (do now)
(We could add greater precision later to plan B by creating a new version
beam:coder:windowed_value:v2 which would be nanosecond and would require
runners to correctly perform an internal conversions for
windowing/triggering.)

I think we should go with Plan B and when users request greater precision
we can make that an explicit effort. What do people think?



On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels  wrote:

> Hi,
>
> Thanks for taking care of this issue in the Python SDK, Thomas!
>
> It would be nice to have a uniform precision for timestamps but, as Kenn
> pointed out, timestamps are extracted from systems that have different
> precision.
>
> To add to the list: Flink - milliseconds
>
> After all, it doesn't matter as long as there is sufficient precision
> and conversions are done correctly.
>
> I think we could improve the situation by at least adding a
> "milliseconds" constructor to the Python SDK's Timestamp.
>
> Cheers,
> Max
>
> On 17.04.19 04:13, Kenneth Knowles wrote:
> > I am not so sure this is a good idea. Here are some systems and their
> > precision:
> >
> > Arrow - microseconds
> > BigQuery - microseconds
> > New Java instant - nanoseconds
> > Firestore - microseconds
> > Protobuf - nanoseconds
> > Dataflow backend - microseconds
> > Postgresql - microseconds
> > Pubsub publish time - nanoseconds
> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
> > Cassandra - milliseconds
> >
> > IMO it is important to be able to treat any of these as a Beam
> > timestamp, even though they aren't all streaming. Who knows when we
> > might be ingesting a streamed changelog, or using them for reprocessing
> > an archived stream. I think for this purpose we either should
> > standardize on nanoseconds or make the runner's resolution independent
> > of the data representation.
> >
> > I've had some offline conversations about this. I think we can have
> > higher-than-runner precision in the user data, and allow WindowFns and
> > DoFns to operate on this higher-than-runner precision data, and still
> > have consistent watermark treatment. Watermarks are just bounds, after
> all.
> >
> > Kenn
> >
> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise  > > wrote:
> >
> > The Python SDK currently uses timestamps in microsecond resolution
> > while Java SDK, as most would probably expect, uses milliseconds.
> >
> > This causes a few difficulties with portability (Python coders need
> > to convert to millis for WindowedValue and Timers, which is related
> > to a bug I'm looking into:
> >
> > https://issues.apache.org/jira/browse/BEAM-7035
> >
> > As Luke pointed out, the issue was previously discussed:
> >
> > https://issues.apache.org/jira/browse/BEAM-1524
> >
> > I'm not privy to the reasons why we decided to go with micros in
> > first place, but would it be too big of a change or impractical for
> > other reasons to switch Python SDK to millis before it gets more
> users?
> >
> > Thanks,
> > Thomas
> >
>


Re: New contributor to Beam

2019-04-17 Thread Alan Myrvold
Welcome, Cyrus!

On Wed, Apr 17, 2019 at 12:49 PM Ahmet Altay  wrote:

> Welcome!
>
> On Wed, Apr 17, 2019 at 12:26 PM Rose Nguyen  wrote:
>
>> Welcome, Cyrus!!
>>
>> On Wed, Apr 17, 2019 at 11:58 AM Niklas Hansson <
>> niklas.sven.hans...@gmail.com> wrote:
>>
>>> Welcome :)
>>>
>>> Den ons 17 apr. 2019 kl 20:33 skrev Aizhamal Nurmamat kyzy <
>>> aizha...@google.com>:
>>>
 Welcome Cyrus! We'd love so much to have better docs for Beam [image:
 ][image: ][image: ] Thank you!


 On Wed, Apr 17, 2019 at 11:28 AM Joana Filipa Bernardo Carrasqueira <
 joanafil...@google.com> wrote:

> Welcome Cyrus!
>
> On Wed, Apr 17, 2019 at 11:18 AM Kyle Weaver 
> wrote:
>
>> Welcome!
>>
>> On Wed, Apr 17, 2019 at 10:32 AM Robert Burke 
>> wrote:
>>
>>> Welcome Cyrus! :D Yay better docs!
>>>
>>> On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan <
>>> conne...@google.com> wrote:
>>>
 Welcome Cyrus!!!

 On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin <
 mig...@google.com> wrote:

> Welcome!
>
> --Mikhail
>
> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak <
> meliss...@google.com> wrote:
>
>>
>> Welcome Cyrus!
>>
>>
>> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré <
>> j...@nanthrax.net> wrote:
>>
>>> Welcome !
>>>
>>> Regards
>>> JB
>>>
>>> On 17/04/2019 16:05, Cyrus Maden wrote:
>>> > Hi all!
>>> >
>>> > My name's Cyrus and I'd like to start contributing to Beam.
>>> I'm a
>>> > technical writer so I'm particularly looking forward to
>>> contributing to
>>> > the Beam docs. Could someone add me as a contributor on JIRA
>>> so I can
>>> > create and assign tickets?
>>> >
>>> > My JIRA name is: *cyrusmaden*
>>> > *
>>> > *
>>> > Excited to be a part of this community and to work with ya'll!
>>> >
>>> > Best,
>>> > Cyrus
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>> --
>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com | +1650203
>>
>
>
>>
>> --
>> Rose Thị Nguyễn
>>
>


CassandraIO breakage

2019-04-17 Thread Reuven Lax
Did something break with CassandraIO? It no longer seems to compile.


Re: New contributor to Beam

2019-04-17 Thread Ahmet Altay
Welcome!

On Wed, Apr 17, 2019 at 12:26 PM Rose Nguyen  wrote:

> Welcome, Cyrus!!
>
> On Wed, Apr 17, 2019 at 11:58 AM Niklas Hansson <
> niklas.sven.hans...@gmail.com> wrote:
>
>> Welcome :)
>>
>> Den ons 17 apr. 2019 kl 20:33 skrev Aizhamal Nurmamat kyzy <
>> aizha...@google.com>:
>>
>>> Welcome Cyrus! We'd love so much to have better docs for Beam [image:
>>> ][image: ][image: ] Thank you!
>>>
>>>
>>> On Wed, Apr 17, 2019 at 11:28 AM Joana Filipa Bernardo Carrasqueira <
>>> joanafil...@google.com> wrote:
>>>
 Welcome Cyrus!

 On Wed, Apr 17, 2019 at 11:18 AM Kyle Weaver 
 wrote:

> Welcome!
>
> On Wed, Apr 17, 2019 at 10:32 AM Robert Burke 
> wrote:
>
>> Welcome Cyrus! :D Yay better docs!
>>
>> On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan <
>> conne...@google.com> wrote:
>>
>>> Welcome Cyrus!!!
>>>
>>> On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin <
>>> mig...@google.com> wrote:
>>>
 Welcome!

 --Mikhail

 On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak <
 meliss...@google.com> wrote:

>
> Welcome Cyrus!
>
>
> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
>
>> Welcome !
>>
>> Regards
>> JB
>>
>> On 17/04/2019 16:05, Cyrus Maden wrote:
>> > Hi all!
>> >
>> > My name's Cyrus and I'd like to start contributing to Beam. I'm
>> a
>> > technical writer so I'm particularly looking forward to
>> contributing to
>> > the Beam docs. Could someone add me as a contributor on JIRA so
>> I can
>> > create and assign tickets?
>> >
>> > My JIRA name is: *cyrusmaden*
>> > *
>> > *
>> > Excited to be a part of this community and to work with ya'll!
>> >
>> > Best,
>> > Cyrus
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
> --
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>


>
> --
> Rose Thị Nguyễn
>


Re: New contributor to Beam

2019-04-17 Thread Rose Nguyen
Welcome, Cyrus!!

On Wed, Apr 17, 2019 at 11:58 AM Niklas Hansson <
niklas.sven.hans...@gmail.com> wrote:

> Welcome :)
>
> Den ons 17 apr. 2019 kl 20:33 skrev Aizhamal Nurmamat kyzy <
> aizha...@google.com>:
>
>> Welcome Cyrus! We'd love so much to have better docs for Beam [image: 
>> ][image:
>> ][image: ] Thank you!
>>
>>
>> On Wed, Apr 17, 2019 at 11:28 AM Joana Filipa Bernardo Carrasqueira <
>> joanafil...@google.com> wrote:
>>
>>> Welcome Cyrus!
>>>
>>> On Wed, Apr 17, 2019 at 11:18 AM Kyle Weaver 
>>> wrote:
>>>
 Welcome!

 On Wed, Apr 17, 2019 at 10:32 AM Robert Burke 
 wrote:

> Welcome Cyrus! :D Yay better docs!
>
> On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan 
> wrote:
>
>> Welcome Cyrus!!!
>>
>> On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin 
>> wrote:
>>
>>> Welcome!
>>>
>>> --Mikhail
>>>
>>> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak <
>>> meliss...@google.com> wrote:
>>>

 Welcome Cyrus!


 On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré <
 j...@nanthrax.net> wrote:

> Welcome !
>
> Regards
> JB
>
> On 17/04/2019 16:05, Cyrus Maden wrote:
> > Hi all!
> >
> > My name's Cyrus and I'd like to start contributing to Beam. I'm a
> > technical writer so I'm particularly looking forward to
> contributing to
> > the Beam docs. Could someone add me as a contributor on JIRA so
> I can
> > create and assign tickets?
> >
> > My JIRA name is: *cyrusmaden*
> > *
> > *
> > Excited to be a part of this community and to work with ya'll!
> >
> > Best,
> > Cyrus
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
 --
 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com | +1650203

>>>
>>>

-- 
Rose Thị Nguyễn


Re: New contributor to Beam

2019-04-17 Thread Niklas Hansson
Welcome :)

Den ons 17 apr. 2019 kl 20:33 skrev Aizhamal Nurmamat kyzy <
aizha...@google.com>:

> Welcome Cyrus! We'd love so much to have better docs for Beam [image: 
> ][image:
> ][image: ] Thank you!
>
>
> On Wed, Apr 17, 2019 at 11:28 AM Joana Filipa Bernardo Carrasqueira <
> joanafil...@google.com> wrote:
>
>> Welcome Cyrus!
>>
>> On Wed, Apr 17, 2019 at 11:18 AM Kyle Weaver  wrote:
>>
>>> Welcome!
>>>
>>> On Wed, Apr 17, 2019 at 10:32 AM Robert Burke 
>>> wrote:
>>>
 Welcome Cyrus! :D Yay better docs!

 On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan 
 wrote:

> Welcome Cyrus!!!
>
> On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin 
> wrote:
>
>> Welcome!
>>
>> --Mikhail
>>
>> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak <
>> meliss...@google.com> wrote:
>>
>>>
>>> Welcome Cyrus!
>>>
>>>
>>> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>>
 Welcome !

 Regards
 JB

 On 17/04/2019 16:05, Cyrus Maden wrote:
 > Hi all!
 >
 > My name's Cyrus and I'd like to start contributing to Beam. I'm a
 > technical writer so I'm particularly looking forward to
 contributing to
 > the Beam docs. Could someone add me as a contributor on JIRA so I
 can
 > create and assign tickets?
 >
 > My JIRA name is: *cyrusmaden*
 > *
 > *
 > Excited to be a part of this community and to work with ya'll!
 >
 > Best,
 > Cyrus

 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

>>> --
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>> | +1650203
>>>
>>
>>


Re: New contributor to Beam

2019-04-17 Thread Aizhamal Nurmamat kyzy
Welcome Cyrus! We'd love so much to have better docs for Beam [image:
][image:
][image: ] Thank you!


On Wed, Apr 17, 2019 at 11:28 AM Joana Filipa Bernardo Carrasqueira <
joanafil...@google.com> wrote:

> Welcome Cyrus!
>
> On Wed, Apr 17, 2019 at 11:18 AM Kyle Weaver  wrote:
>
>> Welcome!
>>
>> On Wed, Apr 17, 2019 at 10:32 AM Robert Burke  wrote:
>>
>>> Welcome Cyrus! :D Yay better docs!
>>>
>>> On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan 
>>> wrote:
>>>
 Welcome Cyrus!!!

 On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin 
 wrote:

> Welcome!
>
> --Mikhail
>
> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak 
> wrote:
>
>>
>> Welcome Cyrus!
>>
>>
>> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Welcome !
>>>
>>> Regards
>>> JB
>>>
>>> On 17/04/2019 16:05, Cyrus Maden wrote:
>>> > Hi all!
>>> >
>>> > My name's Cyrus and I'd like to start contributing to Beam. I'm a
>>> > technical writer so I'm particularly looking forward to
>>> contributing to
>>> > the Beam docs. Could someone add me as a contributor on JIRA so I
>>> can
>>> > create and assign tickets?
>>> >
>>> > My JIRA name is: *cyrusmaden*
>>> > *
>>> > *
>>> > Excited to be a part of this community and to work with ya'll!
>>> >
>>> > Best,
>>> > Cyrus
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>> --
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>> | +1650203
>>
>
>


Re: New contributor to Beam

2019-04-17 Thread Joana Filipa Bernardo Carrasqueira
Welcome Cyrus!

On Wed, Apr 17, 2019 at 11:18 AM Kyle Weaver  wrote:

> Welcome!
>
> On Wed, Apr 17, 2019 at 10:32 AM Robert Burke  wrote:
>
>> Welcome Cyrus! :D Yay better docs!
>>
>> On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan 
>> wrote:
>>
>>> Welcome Cyrus!!!
>>>
>>> On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin 
>>> wrote:
>>>
 Welcome!

 --Mikhail

 On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak 
 wrote:

>
> Welcome Cyrus!
>
>
> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré 
> wrote:
>
>> Welcome !
>>
>> Regards
>> JB
>>
>> On 17/04/2019 16:05, Cyrus Maden wrote:
>> > Hi all!
>> >
>> > My name's Cyrus and I'd like to start contributing to Beam. I'm a
>> > technical writer so I'm particularly looking forward to
>> contributing to
>> > the Beam docs. Could someone add me as a contributor on JIRA so I
>> can
>> > create and assign tickets?
>> >
>> > My JIRA name is: *cyrusmaden*
>> > *
>> > *
>> > Excited to be a part of this community and to work with ya'll!
>> >
>> > Best,
>> > Cyrus
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
> --
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
> | +1650203
>


Re: New contributor to Beam

2019-04-17 Thread Kyle Weaver
Welcome!

On Wed, Apr 17, 2019 at 10:32 AM Robert Burke  wrote:

> Welcome Cyrus! :D Yay better docs!
>
> On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan 
> wrote:
>
>> Welcome Cyrus!!!
>>
>> On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin 
>> wrote:
>>
>>> Welcome!
>>>
>>> --Mikhail
>>>
>>> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak 
>>> wrote:
>>>

 Welcome Cyrus!


 On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré 
 wrote:

> Welcome !
>
> Regards
> JB
>
> On 17/04/2019 16:05, Cyrus Maden wrote:
> > Hi all!
> >
> > My name's Cyrus and I'd like to start contributing to Beam. I'm a
> > technical writer so I'm particularly looking forward to contributing
> to
> > the Beam docs. Could someone add me as a contributor on JIRA so I can
> > create and assign tickets?
> >
> > My JIRA name is: *cyrusmaden*
> > *
> > *
> > Excited to be a part of this community and to work with ya'll!
> >
> > Best,
> > Cyrus
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
 --
Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com |
+1650203


Re: New contributor to Beam

2019-04-17 Thread Robert Burke
Welcome Cyrus! :D Yay better docs!

On Wed, 17 Apr 2019 at 10:20, Connell O'Callaghan 
wrote:

> Welcome Cyrus!!!
>
> On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin 
> wrote:
>
>> Welcome!
>>
>> --Mikhail
>>
>> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak 
>> wrote:
>>
>>>
>>> Welcome Cyrus!
>>>
>>>
>>> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Welcome !

 Regards
 JB

 On 17/04/2019 16:05, Cyrus Maden wrote:
 > Hi all!
 >
 > My name's Cyrus and I'd like to start contributing to Beam. I'm a
 > technical writer so I'm particularly looking forward to contributing
 to
 > the Beam docs. Could someone add me as a contributor on JIRA so I can
 > create and assign tickets?
 >
 > My JIRA name is: *cyrusmaden*
 > *
 > *
 > Excited to be a part of this community and to work with ya'll!
 >
 > Best,
 > Cyrus

 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com

>>>


Re: New contributor to Beam

2019-04-17 Thread Connell O'Callaghan
Welcome Cyrus!!!

On Wed, Apr 17, 2019 at 10:11 AM Mikhail Gryzykhin 
wrote:

> Welcome!
>
> --Mikhail
>
> On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak 
> wrote:
>
>>
>> Welcome Cyrus!
>>
>>
>> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Welcome !
>>>
>>> Regards
>>> JB
>>>
>>> On 17/04/2019 16:05, Cyrus Maden wrote:
>>> > Hi all!
>>> >
>>> > My name's Cyrus and I'd like to start contributing to Beam. I'm a
>>> > technical writer so I'm particularly looking forward to contributing to
>>> > the Beam docs. Could someone add me as a contributor on JIRA so I can
>>> > create and assign tickets?
>>> >
>>> > My JIRA name is: *cyrusmaden*
>>> > *
>>> > *
>>> > Excited to be a part of this community and to work with ya'll!
>>> >
>>> > Best,
>>> > Cyrus
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>


Re: New contributor to Beam

2019-04-17 Thread Mikhail Gryzykhin
Welcome!

--Mikhail

On Wed, Apr 17, 2019 at 9:58 AM Melissa Pashniak 
wrote:

>
> Welcome Cyrus!
>
>
> On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré 
> wrote:
>
>> Welcome !
>>
>> Regards
>> JB
>>
>> On 17/04/2019 16:05, Cyrus Maden wrote:
>> > Hi all!
>> >
>> > My name's Cyrus and I'd like to start contributing to Beam. I'm a
>> > technical writer so I'm particularly looking forward to contributing to
>> > the Beam docs. Could someone add me as a contributor on JIRA so I can
>> > create and assign tickets?
>> >
>> > My JIRA name is: *cyrusmaden*
>> > *
>> > *
>> > Excited to be a part of this community and to work with ya'll!
>> >
>> > Best,
>> > Cyrus
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


Re: New contributor to Beam

2019-04-17 Thread Melissa Pashniak
Welcome Cyrus!


On Wed, Apr 17, 2019 at 7:31 AM Jean-Baptiste Onofré 
wrote:

> Welcome !
>
> Regards
> JB
>
> On 17/04/2019 16:05, Cyrus Maden wrote:
> > Hi all!
> >
> > My name's Cyrus and I'd like to start contributing to Beam. I'm a
> > technical writer so I'm particularly looking forward to contributing to
> > the Beam docs. Could someone add me as a contributor on JIRA so I can
> > create and assign tickets?
> >
> > My JIRA name is: *cyrusmaden*
> > *
> > *
> > Excited to be a part of this community and to work with ya'll!
> >
> > Best,
> > Cyrus
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Wait on JdbcIO write completion

2019-04-17 Thread Jean-Baptiste Onofré
I second Alexey (and thanks Alexey ;)).

I also started similar improvements in other IOs (PRs will come soon).

Regards
JB

On 17/04/2019 17:31, Alexey Romanenko wrote:
> Hi Jonathan,
> 
> I just wanted to let you know that this feature [1] was implemented and,
> finally, merged into master. So, it should be included into next Beam
> 2.13 release.
> 
> In few words, it was added new method called “/Write.withResults()/”
> which returns /WriteVoid/ transform that provides “/PCollection/”
> as an output and can be used together with "/Wait.on()"/. So, the simple
> example of writing into two different databases can look like this:
> 
> /PCollection firstWriteResults = data.apply(JdbcIO.write()
>     .withDataSourceConfiguration(CONF_DB_1).withResults());
> data.apply(Wait.on(firstWriteResults))
>     .apply(JdbcIO.write().withDataSourceConfiguration(CONF_DB_2));/
> 
> [1] https://issues.apache.org/jira/browse/BEAM-6732
> 
>> On 22 Feb 2019, at 16:52, Alexey Romanenko > > wrote:
>>
>> I have created new Jira issue for this feature:
>> https://issues.apache.org/jira/browse/BEAM-6732
>>
>> Jonathan, feel free to assign it to yourself if you want to
>> contribute, it is always welcomed =)
>>
>>> On 21 Feb 2019, at 10:23, Jonathan Perron
>>> mailto:jonathan.per...@lumapps.com>> wrote:
>>>
>>> Thank you Eugene for your answer.
>>>
>>> According to your explanation, I think I will go with your 3rd
>>> solution, as this seems the most robust and friendly way to act.
>>>
>>> Jonathan
>>>
>>> On 21/02/2019 02:22, Eugene Kirpichov wrote:
 Hi Jonathan,

 Wait.on() requires a PCollection - it is not possible to change it
 to wait on PDone because all PDone's in the pipeline are the same so
 it's not clear what exactly you'd be waiting on.

 To use the Wait transform with JdbcIO.write(), you would need to
 change 
 https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L761-L762
  to
 simply "return input.apply(ParDo.of(...))" and propagate that into
 the type signature. Then you'd get a waitable PCollection.

 This is a very simple, but backwards-incompatible change. Up to the
 Beam community whether/when people would want to make it.

 It's also possible to make a slightly larger but compatible change,
 where JdbcIO.write() would stay as is, but you could write e.g.
 "JdbcIO.write().withResults()" which would be a new transform that
 *does* return results and is waitable. A similar approach is taken
 in TextIO.write().withOutputFilenames().

 On Wed, Feb 20, 2019 at 4:58 AM Jonathan Perron
 mailto:jonathan.per...@lumapps.com>>
 wrote:

 Hello folks,

 I am meeting a special case where I need to wait for a
 JdbcIO.write()
 operation to be complete to start a second one.

 In the details, I have a PCollection> which
 is used
 to fill two different SQL statement. It is used in a first
 JdbcIO.write() operation to store anonymized user in a table
 (userId
 with an associated userUuid generated with UUID.randomUUID()).
 These two
 parameters have a unique constraint, meaning that a userId
 cannot have
 multiple userUuid. Unfortunately, on several runs of my
 pipeline, the
 UUID will be different, meaning that I need to query this table
 at some
 point, or to use what I describe in the following.

 I am planning to fill a second table with this userUuid with a
 couple of
 others information such as the time of first visit. To limit I/O
 and as
 I got a lot of information in my PCollection, I want to use it
 once more
 with a different SQL statement, where the userUuid is read from the
 first table using a SELECT statement. This cannot work if the first
 JdbcIO.write() operation is not complete.

 I saw that the Java SDK proposes a Wait.on() PTransform, but it is
 unfortunately only compatible with PCollection, and not a PDone
 such as
 the one output from the JdbcIO operation. Could my issue be
 solved by
 expanding the Wait.On() or should I go with an other solution ?
 If so,
 how could I implement it ?

 Many thanks for your input !

 Jonathan

>>
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Wait on JdbcIO write completion

2019-04-17 Thread Alexey Romanenko
Hi Jonathan,

I just wanted to let you know that this feature [1] was implemented and, 
finally, merged into master. So, it should be included into next Beam 2.13 
release.

In few words, it was added new method called “Write.withResults()” which 
returns WriteVoid transform that provides “PCollection” as an output and 
can be used together with "Wait.on()". So, the simple example of writing into 
two different databases can look like this:

PCollection firstWriteResults = data.apply(JdbcIO.write()
.withDataSourceConfiguration(CONF_DB_1).withResults());
data.apply(Wait.on(firstWriteResults))
.apply(JdbcIO.write().withDataSourceConfiguration(CONF_DB_2));

[1] https://issues.apache.org/jira/browse/BEAM-6732 


> On 22 Feb 2019, at 16:52, Alexey Romanenko  wrote:
> 
> I have created new Jira issue for this feature:
> https://issues.apache.org/jira/browse/BEAM-6732 
> 
> 
> Jonathan, feel free to assign it to yourself if you want to contribute, it is 
> always welcomed =)
> 
>> On 21 Feb 2019, at 10:23, Jonathan Perron > > wrote:
>> 
>> Thank you Eugene for your answer.
>> 
>> According to your explanation, I think I will go with your 3rd solution, as 
>> this seems the most robust and friendly way to act.
>> 
>> Jonathan
>> On 21/02/2019 02:22, Eugene Kirpichov wrote:
>>> Hi Jonathan,
>>> 
>>> Wait.on() requires a PCollection - it is not possible to change it to wait 
>>> on PDone because all PDone's in the pipeline are the same so it's not clear 
>>> what exactly you'd be waiting on.
>>> 
>>> To use the Wait transform with JdbcIO.write(), you would need to change 
>>> https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L761-L762
>>>  
>>> 
>>>  to simply "return input.apply(ParDo.of(...))" and propagate that into the 
>>> type signature. Then you'd get a waitable PCollection.
>>> 
>>> This is a very simple, but backwards-incompatible change. Up to the Beam 
>>> community whether/when people would want to make it.
>>> 
>>> It's also possible to make a slightly larger but compatible change, where 
>>> JdbcIO.write() would stay as is, but you could write e.g. 
>>> "JdbcIO.write().withResults()" which would be a new transform that *does* 
>>> return results and is waitable. A similar approach is taken in 
>>> TextIO.write().withOutputFilenames().
>>> 
>>> On Wed, Feb 20, 2019 at 4:58 AM Jonathan Perron 
>>> mailto:jonathan.per...@lumapps.com>> wrote:
>>> Hello folks,
>>> 
>>> I am meeting a special case where I need to wait for a JdbcIO.write() 
>>> operation to be complete to start a second one.
>>> 
>>> In the details, I have a PCollection> which is used 
>>> to fill two different SQL statement. It is used in a first 
>>> JdbcIO.write() operation to store anonymized user in a table (userId 
>>> with an associated userUuid generated with UUID.randomUUID()). These two 
>>> parameters have a unique constraint, meaning that a userId cannot have 
>>> multiple userUuid. Unfortunately, on several runs of my pipeline, the 
>>> UUID will be different, meaning that I need to query this table at some 
>>> point, or to use what I describe in the following.
>>> 
>>> I am planning to fill a second table with this userUuid with a couple of 
>>> others information such as the time of first visit. To limit I/O and as 
>>> I got a lot of information in my PCollection, I want to use it once more 
>>> with a different SQL statement, where the userUuid is read from the 
>>> first table using a SELECT statement. This cannot work if the first 
>>> JdbcIO.write() operation is not complete.
>>> 
>>> I saw that the Java SDK proposes a Wait.on() PTransform, but it is 
>>> unfortunately only compatible with PCollection, and not a PDone such as 
>>> the one output from the JdbcIO operation. Could my issue be solved by 
>>> expanding the Wait.On() or should I go with an other solution ? If so, 
>>> how could I implement it ?
>>> 
>>> Many thanks for your input !
>>> 
>>> Jonathan
>>> 
> 



Re: [DISCUSS] Adding GroupByKeyAndSort

2019-04-17 Thread Viliam Durina
> Combine.perKey ... certainly is standardized / well-defined

Is there any document where it's defined?

Viliam

On Tue, 16 Apr 2019 at 18:27, Kenneth Knowles  wrote:

> On Tue, Apr 16, 2019 at 9:18 AM Reuven Lax  wrote:
>
>> A common request (especially in streaming) is to support sorting values
>> by timestamp, not by the full value.
>>
>
> On this point, I think an explicit secondary key probably addresses the
> need. Naively implemented, the "sort by values" use case would have a lot
> of data duplication so we might have some payload on the transform to
> configure that, or a couple of related transforms.
>
> Kenn
>
>
>>
>> Reuven
>>
>> On Tue, Apr 16, 2019 at 9:08 AM Kenneth Knowles  wrote:
>>
>>> 1. This is clearly useful, and extensively used. Agree with all that. I
>>> think it can work for batch and streaming equally well if sorting is
>>> required only per "pane", though I might be overlooking something.
>>>
>>> 2. A transform need not be primitive to be well-defined and executed in
>>> a special way by most runners. For example, Combine.perKey is not a
>>> "primitive", where primitive means "axiomatic, lacking an expansion to
>>> other transforms". It has a composite definition in terms of other
>>> transforms. However, it certainly is standardized / well-defined and
>>> executed in a custom way by all runners, with the possible exception of
>>> direct runners (I didn't double check this). To make something a
>>> standardized well-defined transform it just needs a URN and an explicitly
>>> documented payload that goes along with the URN (which might be empty).
>>> Apologies if this is going into details you already know; I just want to
>>> emphasize that this is a key aspect of Beam's design, avoiding
>>> proliferation of primitives while allowing runners to optimize execution.
>>>
>>> In order for GroupByKeyAndSortValues* to have a status analogous to
>>> Combine.perKey it needs a URN (say, "beam:transforms:gbk-and-sort-values")
>>> and a code location where it can have a fallback composite definition. I
>>> would suggest piloting the idea of making experimental features opt-in
>>> includes with "experimenta" in the artifact id, so something like artifact
>>> id "org.apache.beam:beam-sdks-java-experimental-gbk-and-sort-values" (very
>>> long, open to improvement). Another idea would be
>>> "org.apache.beam.experiments" as a group id.
>>>
>>> Kenn
>>>
>>> *Note that BatchViewOverrides.GroupByKeyAndSortValuesOnly is actually an
>>> even lower-level primitive, the "Only" part indicates that it is windowing
>>> and event time unaware.
>>>
>>> On Tue, Apr 16, 2019 at 7:42 AM Gleb Kanterov  wrote:
>>>
 At the moment, portability has GroupByKey transform. In most data
 processing frameworks, such as Hadoop MR and Apache Spark there is a
 concept of secondary sorting during the shuffle phase. Dataflow worker code
 has it under the name BatchViewOverrides.GroupByKeyAndSortValuesOnly [1],
 it's PTransform>>, PCollection>>> Iterable. It does sharding by K1 and sorting by K2 within
 each shard.

 I see a lot of value in adding GroupByKeyAndSort to the list of
 built-in transforms so that runners can efficiently override it. It's
 possible to define GroupByKeyAndSort as GroupByKey+SortValues [2], however,
 having it as primitive will open the possibility for more efficient
 implementation. What could be potential drawbacks? I didn't think much how
 it could work for non-bach pipelines.

 Gleb

 [1]:
 https://github.com/spotify/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/BatchViewOverrides.java#L1246
 [2]:
 https://github.com/apache/beam/blob/master/sdks/java/extensions/sorter/src/main/java/org/apache/beam/sdk/extensions/sorter/SortValues.java




Re: New contributor to Beam

2019-04-17 Thread Jean-Baptiste Onofré
Welcome !

Regards
JB

On 17/04/2019 16:05, Cyrus Maden wrote:
> Hi all!
> 
> My name's Cyrus and I'd like to start contributing to Beam. I'm a
> technical writer so I'm particularly looking forward to contributing to
> the Beam docs. Could someone add me as a contributor on JIRA so I can
> create and assign tickets?
> 
> My JIRA name is: *cyrusmaden*
> *
> *
> Excited to be a part of this community and to work with ya'll!
> 
> Best,
> Cyrus

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


New contributor to Beam

2019-04-17 Thread Cyrus Maden
Hi all!

My name's Cyrus and I'd like to start contributing to Beam. I'm a technical
writer so I'm particularly looking forward to contributing to the Beam
docs. Could someone add me as a contributor on JIRA so I can create and
assign tickets?

My JIRA name is: *cyrusmaden*

Excited to be a part of this community and to work with ya'll!

Best,
Cyrus


Re: Python SDK timestamp precision

2019-04-17 Thread Maximilian Michels

Hi,

Thanks for taking care of this issue in the Python SDK, Thomas!

It would be nice to have a uniform precision for timestamps but, as Kenn 
pointed out, timestamps are extracted from systems that have different 
precision.


To add to the list: Flink - milliseconds

After all, it doesn't matter as long as there is sufficient precision 
and conversions are done correctly.


I think we could improve the situation by at least adding a 
"milliseconds" constructor to the Python SDK's Timestamp.


Cheers,
Max

On 17.04.19 04:13, Kenneth Knowles wrote:
I am not so sure this is a good idea. Here are some systems and their 
precision:


Arrow - microseconds
BigQuery - microseconds
New Java instant - nanoseconds
Firestore - microseconds
Protobuf - nanoseconds
Dataflow backend - microseconds
Postgresql - microseconds
Pubsub publish time - nanoseconds
MSSQL datetime2 - 100 nanoseconds (original datetime about 3 millis)
Cassandra - milliseconds

IMO it is important to be able to treat any of these as a Beam 
timestamp, even though they aren't all streaming. Who knows when we 
might be ingesting a streamed changelog, or using them for reprocessing 
an archived stream. I think for this purpose we either should 
standardize on nanoseconds or make the runner's resolution independent 
of the data representation.


I've had some offline conversations about this. I think we can have 
higher-than-runner precision in the user data, and allow WindowFns and 
DoFns to operate on this higher-than-runner precision data, and still 
have consistent watermark treatment. Watermarks are just bounds, after all.


Kenn

On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise > wrote:


The Python SDK currently uses timestamps in microsecond resolution
while Java SDK, as most would probably expect, uses milliseconds.

This causes a few difficulties with portability (Python coders need
to convert to millis for WindowedValue and Timers, which is related
to a bug I'm looking into:

https://issues.apache.org/jira/browse/BEAM-7035

As Luke pointed out, the issue was previously discussed:

https://issues.apache.org/jira/browse/BEAM-1524

I'm not privy to the reasons why we decided to go with micros in
first place, but would it be too big of a change or impractical for
other reasons to switch Python SDK to millis before it gets more users?

Thanks,
Thomas



Re: [VOTE] Release 2.12.0, release candidate #4

2019-04-17 Thread Jean-Baptiste Onofré
+1 (binding)

Quickly checked with beam-samples.

Regards
JB

On 16/04/2019 00:50, Andrew Pilloud wrote:
> Hi everyone,
> 
> Please review and vote on the release candidate #4 for the version
> 2.12.0, as follows:
> 
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
>  [2], which is signed with the key with
> fingerprint 9E7CEC0661EFD610B632C610AE8FE17F9F8AE3D4 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.12.0-RC4" [5],
> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> * Java artifacts were built with Gradle/5.2.1 and OpenJDK/Oracle JDK
> 1.8.0_181.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org  [2].
> * Validation sheet with a tab for 2.12.0 release to help with validation
> [9].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> Thanks,
> Andrew
> 
> 1] 
> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12344944
> [2] https://dist.apache.org/repos/dist/dev/beam/2.12.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1068/
> [5] https://github.com/apache/beam/tree/v2.12.0-RC4 
> [6] https://github.com/apache/beam/pull/8215
> [7] https://github.com/apache/beam-site/pull/588
> [8] https://github.com/apache/beam/pull/8314
> [9] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1007316984

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com