Re: Jira components for cross-language transforms

2020-05-28 Thread Heejong Lee
If we use one meta component tag for all xlang related issues, I would
prefer just "xlang". Then we could attach the "xlang" tag to not only
language specific sdk tags but also other runner tags e.g. ['xlang',
'io-java-kafka'], ['xlang'', 'runner-dataflow'].

On Thu, May 28, 2020 at 7:49 PM Robert Burke  wrote:

> +1 to new component not split. The language concerns can be represented
> and filtered with the existing sdk tags. I know I'm interested in all
> sdk-go issues, and would prefer not to have to union tags when searching
> for Go related issues.
>
> On Thu, 28 May 2020 at 15:48, Ismaël Mejía  wrote:
>
>> +1 to new component not splitted
>>
>> Other use case is using libraries not available in your language e.g.
>> using some python transform that relies in a python only API in the middle
>> of a Java pipeline.
>>
>>
>> On Thu, May 28, 2020 at 11:12 PM Chamikara Jayalath 
>> wrote:
>>
>>> I proposed three components since the audience might be different. Also
>>> we can use the same component to track issues related to all cross-language
>>> wrappers available in a given SDK. If this is too much a single component
>>> is fine as well.
>>>
>>> Ashwin, as others pointed out, the cross-language transforms framework
>>> is primarily for giving SDKs access to transforms that are not
>>> available natively. But there are other potential use-cases as well (for
>>> example, using two different Python environments within the same
>>> pipeline).
>>> Exact performance will depend on the runner implementation as well as
>>> the additional cost involved due to serializing/deserializing data across
>>> environment boundaries. But we haven't done enough analysis/benchmarking to
>>> provide more details on this.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Thu, May 28, 2020 at 1:55 PM Kyle Weaver  wrote:
>>>
 > What are some of the benefits / drawbacks of using cross-language
 transforms? Would a native Python transform perform better than a
 cross-language transform written in Java that is then used in a Python
 pipeline?

 As Rui says, the main advantage is code reuse. See
 https://beam.apache.org/roadmap/connectors-multi-sdk/ for more
 information.

 On Thu, May 28, 2020 at 4:53 PM Rui Wang  wrote:

> +1 on dedicated components for cross-language transform. It might be
> easy to manage to have one component (one tag for all SDK) rather than
> multiple ones.
>
>
> Re Ashwin,
>
> Cham knows more than me. AFAIK, cross-language transforms will
> maximize code reuse for newly developed SDK (e.g. IO transforms for Go
> SDK). Of course, a SDK can develop its own IOs, but it's lots of work.
>
>
> -Rui
>
> On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami <
> aramaswa...@gmail.com> wrote:
>
>> What are some of the benefits / drawbacks of using cross-language
>> transforms? Would a native Python transform perform better than a
>> cross-language transform written in Java that is then used in a Python
>> pipeline?
>>
>> Ashwin Ramaswami
>> Student
>> *Find me on my:* LinkedIn  |
>> Website  | GitHub
>> 
>>
>>
>> On Thu, May 28, 2020 at 4:44 PM Kyle Weaver 
>> wrote:
>>
>>> SGTM. Though I'm not sure it's necessary to split by language. It
>>> might be easier to use a single cross-language tag, rather than having 
>>> to
>>> tag lots of issues as both sdks-python-xlang and sdks-java-xlang.
>>>
>>> On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
 Hi All,

 I think it's good if we can have new Jira components to easily
 track various issues related to cross-language transforms.

 What do you think about adding the following Jira components ?

 sdks-python-xlang
 sdks-java-xlang
 sdks-go-xlang

 Jira component sdks-foo-xlang is for tracking issues related to
 cross-language transforms for SDK Foo. For example,
 * Issues related cross-language transforms wrappers written in SDK
 Foo
 * Issues related to transforms implemented in SDK Foo that are
 offered as cross-language transforms to other SDKs
 * Issues related to cross-language transform expansion service
 implemented for SDK Foo

 Thanks,
 Cham

>>>


Re: [Discuss] Build Kafka read transform on top of SplittableDoFn

2020-05-28 Thread Reuven Lax
This is per-partition, right? In that case I assume it will match the
current Kafka watermark.

On Thu, May 28, 2020 at 9:03 PM Boyuan Zhang  wrote:

> Hi Reuven,
>
> I'm going to use MonotonicallyIncreasing
> 
>  by
> default and in the future, we may want to support custom kind if there is a
> request.
>
> On Thu, May 28, 2020 at 8:54 PM Reuven Lax  wrote:
>
>> Which WatermarkEstimator do you think should be used?
>>
>> On Thu, May 28, 2020 at 7:17 PM Boyuan Zhang  wrote:
>>
>>> Hi team,
>>>
>>> I'm Boyuan, currently working on building a Kafka read PTransform on top
>>> of SplittableDoFn[1][2][3]. There are two questions about Kafka usage I
>>> want to discuss with you:
>>>
>>> 1.  Compared to the KafkaIO.Read
>>> ,
>>> the SplittableDoFn Kafka version allows taking TopicPartition and
>>> startReadTime as elements and processing them during execution time,
>>> instead of configuring topics at pipeline construction time. I'm wondering
>>> whether there are other configurations we also want to populate during
>>> pipeline execution time instead of construction time. Taking these
>>> configurations as elements would make value when they could be different
>>> for different TopicPartition. For a list of configurations we have now,
>>> please refer to KafkaIO.Read
>>> 
>>> .
>>>
>>> 2. I also want to offer a simple way for KafkaIO.Read to expand with the
>>> SDF version PTransform. Almost all configurations can be translated easily
>>> from KafkaIO.Read to the SDF version read except custom
>>> TimestampPolicyFactory (It's easy to translate build-in default types such
>>> as withProcessingTime
>>> ,
>>> withCreateTime
>>> 
>>> and withLogAppendTime
>>> .).
>>> With SplittableDoFn, we have WatermarkEstimator
>>> 
>>> to track watermark per TopicPartition. Thus, instead of
>>> TimestampPolicyFactory
>>> 
>>>  ,
>>> we need the user to provide a function which can extract output timestamp
>>> from a KafkaRecord(like withTimestampFn
>>> ).
>>> My question here is, are the default types enough for current Kafka.Read
>>> users? If the custom TimestampPolicy is really in common? Is it okay to use
>>> current API withTimestampFn
>>> 
>>>  in
>>> KafkaIO.Read to accept the custom function and populate it to the SDF read
>>> transform?
>>>
>>> Thanks for your help!
>>>
>>> [1] https://beam.apache.org/blog/splittable-do-fn/
>>> [2] https://s.apache.org/splittable-do-fn
>>> [3] My prototype PR https://github.com/apache/beam/pull/11749
>>>
>>


Re: [Discuss] Build Kafka read transform on top of SplittableDoFn

2020-05-28 Thread Boyuan Zhang
Hi Reuven,

I'm going to use MonotonicallyIncreasing

by
default and in the future, we may want to support custom kind if there is a
request.

On Thu, May 28, 2020 at 8:54 PM Reuven Lax  wrote:

> Which WatermarkEstimator do you think should be used?
>
> On Thu, May 28, 2020 at 7:17 PM Boyuan Zhang  wrote:
>
>> Hi team,
>>
>> I'm Boyuan, currently working on building a Kafka read PTransform on top
>> of SplittableDoFn[1][2][3]. There are two questions about Kafka usage I
>> want to discuss with you:
>>
>> 1.  Compared to the KafkaIO.Read
>> ,
>> the SplittableDoFn Kafka version allows taking TopicPartition and
>> startReadTime as elements and processing them during execution time,
>> instead of configuring topics at pipeline construction time. I'm wondering
>> whether there are other configurations we also want to populate during
>> pipeline execution time instead of construction time. Taking these
>> configurations as elements would make value when they could be different
>> for different TopicPartition. For a list of configurations we have now,
>> please refer to KafkaIO.Read
>> 
>> .
>>
>> 2. I also want to offer a simple way for KafkaIO.Read to expand with the
>> SDF version PTransform. Almost all configurations can be translated easily
>> from KafkaIO.Read to the SDF version read except custom
>> TimestampPolicyFactory (It's easy to translate build-in default types such
>> as withProcessingTime
>> ,
>> withCreateTime
>> 
>> and withLogAppendTime
>> .).
>> With SplittableDoFn, we have WatermarkEstimator
>> 
>> to track watermark per TopicPartition. Thus, instead of
>> TimestampPolicyFactory
>> 
>>  ,
>> we need the user to provide a function which can extract output timestamp
>> from a KafkaRecord(like withTimestampFn
>> ).
>> My question here is, are the default types enough for current Kafka.Read
>> users? If the custom TimestampPolicy is really in common? Is it okay to use
>> current API withTimestampFn
>> 
>>  in
>> KafkaIO.Read to accept the custom function and populate it to the SDF read
>> transform?
>>
>> Thanks for your help!
>>
>> [1] https://beam.apache.org/blog/splittable-do-fn/
>> [2] https://s.apache.org/splittable-do-fn
>> [3] My prototype PR https://github.com/apache/beam/pull/11749
>>
>


Re: [Discuss] Build Kafka read transform on top of SplittableDoFn

2020-05-28 Thread Reuven Lax
Which WatermarkEstimator do you think should be used?

On Thu, May 28, 2020 at 7:17 PM Boyuan Zhang  wrote:

> Hi team,
>
> I'm Boyuan, currently working on building a Kafka read PTransform on top
> of SplittableDoFn[1][2][3]. There are two questions about Kafka usage I
> want to discuss with you:
>
> 1.  Compared to the KafkaIO.Read
> ,
> the SplittableDoFn Kafka version allows taking TopicPartition and
> startReadTime as elements and processing them during execution time,
> instead of configuring topics at pipeline construction time. I'm wondering
> whether there are other configurations we also want to populate during
> pipeline execution time instead of construction time. Taking these
> configurations as elements would make value when they could be different
> for different TopicPartition. For a list of configurations we have now,
> please refer to KafkaIO.Read
> 
> .
>
> 2. I also want to offer a simple way for KafkaIO.Read to expand with the
> SDF version PTransform. Almost all configurations can be translated easily
> from KafkaIO.Read to the SDF version read except custom
> TimestampPolicyFactory (It's easy to translate build-in default types such
> as withProcessingTime
> ,
> withCreateTime
> 
> and withLogAppendTime
> .).
> With SplittableDoFn, we have WatermarkEstimator
> 
> to track watermark per TopicPartition. Thus, instead of
> TimestampPolicyFactory
> 
>  ,
> we need the user to provide a function which can extract output timestamp
> from a KafkaRecord(like withTimestampFn
> ).
> My question here is, are the default types enough for current Kafka.Read
> users? If the custom TimestampPolicy is really in common? Is it okay to use
> current API withTimestampFn
> 
>  in
> KafkaIO.Read to accept the custom function and populate it to the SDF read
> transform?
>
> Thanks for your help!
>
> [1] https://beam.apache.org/blog/splittable-do-fn/
> [2] https://s.apache.org/splittable-do-fn
> [3] My prototype PR https://github.com/apache/beam/pull/11749
>


Re: Contributor permission for beam jira tickets

2020-05-28 Thread Robert Burke
Welcome! I think we've interacted on slack, but please feel free to tag me
if you have questions or would like PRs reviewed in merge. I'm @lostluck on
both on the beam-go slack and on github.

On Wed, 27 May 2020 at 14:44, Gris Cuevas  wrote:

> Welcome!
>
> On 2020/05/27 09:12:52, Aaron Tillekeratne 
> wrote:
> > Hi,
> >
> > I'm Aaron, I'm a newbie go developer and beam but i'm interested in
> getting
> > involved and helping out with the beam go SDK.
> >
> > I want to start with just the started tasks and slowly tackle bigger
> > problems. My jira id is codeBehindMe and i'd be able to assign tasks to
> > myself as a contributor.
> >
> > Cheers,
> > Aaron
> >
> >
> > --
> > Aaron Tillekeratne
> > Software Engineer / Data Scientist
> > BEng, GDip (Data Science)
> >
>


Re: Jira components for cross-language transforms

2020-05-28 Thread Robert Burke
+1 to new component not split. The language concerns can be represented and
filtered with the existing sdk tags. I know I'm interested in all sdk-go
issues, and would prefer not to have to union tags when searching for Go
related issues.

On Thu, 28 May 2020 at 15:48, Ismaël Mejía  wrote:

> +1 to new component not splitted
>
> Other use case is using libraries not available in your language e.g.
> using some python transform that relies in a python only API in the middle
> of a Java pipeline.
>
>
> On Thu, May 28, 2020 at 11:12 PM Chamikara Jayalath 
> wrote:
>
>> I proposed three components since the audience might be different. Also
>> we can use the same component to track issues related to all cross-language
>> wrappers available in a given SDK. If this is too much a single component
>> is fine as well.
>>
>> Ashwin, as others pointed out, the cross-language transforms framework is
>> primarily for giving SDKs access to transforms that are not
>> available natively. But there are other potential use-cases as well (for
>> example, using two different Python environments within the same
>> pipeline).
>> Exact performance will depend on the runner implementation as well as the
>> additional cost involved due to serializing/deserializing data across
>> environment boundaries. But we haven't done enough analysis/benchmarking to
>> provide more details on this.
>>
>> Thanks,
>> Cham
>>
>> On Thu, May 28, 2020 at 1:55 PM Kyle Weaver  wrote:
>>
>>> > What are some of the benefits / drawbacks of using cross-language
>>> transforms? Would a native Python transform perform better than a
>>> cross-language transform written in Java that is then used in a Python
>>> pipeline?
>>>
>>> As Rui says, the main advantage is code reuse. See
>>> https://beam.apache.org/roadmap/connectors-multi-sdk/ for more
>>> information.
>>>
>>> On Thu, May 28, 2020 at 4:53 PM Rui Wang  wrote:
>>>
 +1 on dedicated components for cross-language transform. It might be
 easy to manage to have one component (one tag for all SDK) rather than
 multiple ones.


 Re Ashwin,

 Cham knows more than me. AFAIK, cross-language transforms will maximize
 code reuse for newly developed SDK (e.g. IO transforms for Go SDK). Of
 course, a SDK can develop its own IOs, but it's lots of work.


 -Rui

 On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami 
 wrote:

> What are some of the benefits / drawbacks of using cross-language
> transforms? Would a native Python transform perform better than a
> cross-language transform written in Java that is then used in a Python
> pipeline?
>
> Ashwin Ramaswami
> Student
> *Find me on my:* LinkedIn  |
> Website  | GitHub
> 
>
>
> On Thu, May 28, 2020 at 4:44 PM Kyle Weaver 
> wrote:
>
>> SGTM. Though I'm not sure it's necessary to split by language. It
>> might be easier to use a single cross-language tag, rather than having to
>> tag lots of issues as both sdks-python-xlang and sdks-java-xlang.
>>
>> On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> Hi All,
>>>
>>> I think it's good if we can have new Jira components to easily track
>>> various issues related to cross-language transforms.
>>>
>>> What do you think about adding the following Jira components ?
>>>
>>> sdks-python-xlang
>>> sdks-java-xlang
>>> sdks-go-xlang
>>>
>>> Jira component sdks-foo-xlang is for tracking issues related to
>>> cross-language transforms for SDK Foo. For example,
>>> * Issues related cross-language transforms wrappers written in SDK
>>> Foo
>>> * Issues related to transforms implemented in SDK Foo that are
>>> offered as cross-language transforms to other SDKs
>>> * Issues related to cross-language transform expansion service
>>> implemented for SDK Foo
>>>
>>> Thanks,
>>> Cham
>>>
>>


[Discuss] Build Kafka read transform on top of SplittableDoFn

2020-05-28 Thread Boyuan Zhang
Hi team,

I'm Boyuan, currently working on building a Kafka read PTransform on top of
SplittableDoFn[1][2][3]. There are two questions about Kafka usage I want
to discuss with you:

1.  Compared to the KafkaIO.Read
,
the SplittableDoFn Kafka version allows taking TopicPartition and
startReadTime as elements and processing them during execution time,
instead of configuring topics at pipeline construction time. I'm wondering
whether there are other configurations we also want to populate during
pipeline execution time instead of construction time. Taking these
configurations as elements would make value when they could be different
for different TopicPartition. For a list of configurations we have now,
please refer to KafkaIO.Read

.

2. I also want to offer a simple way for KafkaIO.Read to expand with the
SDF version PTransform. Almost all configurations can be translated easily
from KafkaIO.Read to the SDF version read except custom
TimestampPolicyFactory (It's easy to translate build-in default types such
as withProcessingTime
,
withCreateTime

and withLogAppendTime
.).
With SplittableDoFn, we have WatermarkEstimator

to track watermark per TopicPartition. Thus, instead of
TimestampPolicyFactory

,
we need the user to provide a function which can extract output timestamp
from a KafkaRecord(like withTimestampFn
).
My question here is, are the default types enough for current Kafka.Read
users? If the custom TimestampPolicy is really in common? Is it okay to use
current API withTimestampFn

in
KafkaIO.Read to accept the custom function and populate it to the SDF read
transform?

Thanks for your help!

[1] https://beam.apache.org/blog/splittable-do-fn/
[2] https://s.apache.org/splittable-do-fn
[3] My prototype PR https://github.com/apache/beam/pull/11749


Re: Jira components for cross-language transforms

2020-05-28 Thread Ismaël Mejía
+1 to new component not splitted

Other use case is using libraries not available in your language e.g. using
some python transform that relies in a python only API in the middle of a
Java pipeline.


On Thu, May 28, 2020 at 11:12 PM Chamikara Jayalath 
wrote:

> I proposed three components since the audience might be different. Also we
> can use the same component to track issues related to all cross-language
> wrappers available in a given SDK. If this is too much a single component
> is fine as well.
>
> Ashwin, as others pointed out, the cross-language transforms framework is
> primarily for giving SDKs access to transforms that are not
> available natively. But there are other potential use-cases as well (for
> example, using two different Python environments within the same
> pipeline).
> Exact performance will depend on the runner implementation as well as the
> additional cost involved due to serializing/deserializing data across
> environment boundaries. But we haven't done enough analysis/benchmarking to
> provide more details on this.
>
> Thanks,
> Cham
>
> On Thu, May 28, 2020 at 1:55 PM Kyle Weaver  wrote:
>
>> > What are some of the benefits / drawbacks of using cross-language
>> transforms? Would a native Python transform perform better than a
>> cross-language transform written in Java that is then used in a Python
>> pipeline?
>>
>> As Rui says, the main advantage is code reuse. See
>> https://beam.apache.org/roadmap/connectors-multi-sdk/ for more
>> information.
>>
>> On Thu, May 28, 2020 at 4:53 PM Rui Wang  wrote:
>>
>>> +1 on dedicated components for cross-language transform. It might be
>>> easy to manage to have one component (one tag for all SDK) rather than
>>> multiple ones.
>>>
>>>
>>> Re Ashwin,
>>>
>>> Cham knows more than me. AFAIK, cross-language transforms will maximize
>>> code reuse for newly developed SDK (e.g. IO transforms for Go SDK). Of
>>> course, a SDK can develop its own IOs, but it's lots of work.
>>>
>>>
>>> -Rui
>>>
>>> On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami 
>>> wrote:
>>>
 What are some of the benefits / drawbacks of using cross-language
 transforms? Would a native Python transform perform better than a
 cross-language transform written in Java that is then used in a Python
 pipeline?

 Ashwin Ramaswami
 Student
 *Find me on my:* LinkedIn  |
 Website  | GitHub
 


 On Thu, May 28, 2020 at 4:44 PM Kyle Weaver 
 wrote:

> SGTM. Though I'm not sure it's necessary to split by language. It
> might be easier to use a single cross-language tag, rather than having to
> tag lots of issues as both sdks-python-xlang and sdks-java-xlang.
>
> On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> Hi All,
>>
>> I think it's good if we can have new Jira components to easily track
>> various issues related to cross-language transforms.
>>
>> What do you think about adding the following Jira components ?
>>
>> sdks-python-xlang
>> sdks-java-xlang
>> sdks-go-xlang
>>
>> Jira component sdks-foo-xlang is for tracking issues related to
>> cross-language transforms for SDK Foo. For example,
>> * Issues related cross-language transforms wrappers written in SDK Foo
>> * Issues related to transforms implemented in SDK Foo that are
>> offered as cross-language transforms to other SDKs
>> * Issues related to cross-language transform expansion service
>> implemented for SDK Foo
>>
>> Thanks,
>> Cham
>>
>


Re: Jira components for cross-language transforms

2020-05-28 Thread Chamikara Jayalath
I proposed three components since the audience might be different. Also we
can use the same component to track issues related to all cross-language
wrappers available in a given SDK. If this is too much a single component
is fine as well.

Ashwin, as others pointed out, the cross-language transforms framework is
primarily for giving SDKs access to transforms that are not
available natively. But there are other potential use-cases as well (for
example, using two different Python environments within the same
pipeline).
Exact performance will depend on the runner implementation as well as the
additional cost involved due to serializing/deserializing data across
environment boundaries. But we haven't done enough analysis/benchmarking to
provide more details on this.

Thanks,
Cham

On Thu, May 28, 2020 at 1:55 PM Kyle Weaver  wrote:

> > What are some of the benefits / drawbacks of using cross-language
> transforms? Would a native Python transform perform better than a
> cross-language transform written in Java that is then used in a Python
> pipeline?
>
> As Rui says, the main advantage is code reuse. See
> https://beam.apache.org/roadmap/connectors-multi-sdk/ for more
> information.
>
> On Thu, May 28, 2020 at 4:53 PM Rui Wang  wrote:
>
>> +1 on dedicated components for cross-language transform. It might be easy
>> to manage to have one component (one tag for all SDK) rather than
>> multiple ones.
>>
>>
>> Re Ashwin,
>>
>> Cham knows more than me. AFAIK, cross-language transforms will maximize
>> code reuse for newly developed SDK (e.g. IO transforms for Go SDK). Of
>> course, a SDK can develop its own IOs, but it's lots of work.
>>
>>
>> -Rui
>>
>> On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami 
>> wrote:
>>
>>> What are some of the benefits / drawbacks of using cross-language
>>> transforms? Would a native Python transform perform better than a
>>> cross-language transform written in Java that is then used in a Python
>>> pipeline?
>>>
>>> Ashwin Ramaswami
>>> Student
>>> *Find me on my:* LinkedIn  |
>>> Website  | GitHub
>>> 
>>>
>>>
>>> On Thu, May 28, 2020 at 4:44 PM Kyle Weaver  wrote:
>>>
 SGTM. Though I'm not sure it's necessary to split by language. It might
 be easier to use a single cross-language tag, rather than having to tag
 lots of issues as both sdks-python-xlang and sdks-java-xlang.

 On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Hi All,
>
> I think it's good if we can have new Jira components to easily track
> various issues related to cross-language transforms.
>
> What do you think about adding the following Jira components ?
>
> sdks-python-xlang
> sdks-java-xlang
> sdks-go-xlang
>
> Jira component sdks-foo-xlang is for tracking issues related to
> cross-language transforms for SDK Foo. For example,
> * Issues related cross-language transforms wrappers written in SDK Foo
> * Issues related to transforms implemented in SDK Foo that are
> offered as cross-language transforms to other SDKs
> * Issues related to cross-language transform expansion service
> implemented for SDK Foo
>
> Thanks,
> Cham
>



Re: Jira components for cross-language transforms

2020-05-28 Thread Robert Bradshaw
+1 to a new component. I would not split things by language.

On Thu, May 28, 2020 at 1:55 PM Kyle Weaver  wrote:

> > What are some of the benefits / drawbacks of using cross-language
> transforms? Would a native Python transform perform better than a
> cross-language transform written in Java that is then used in a Python
> pipeline?
>
> As Rui says, the main advantage is code reuse. See
> https://beam.apache.org/roadmap/connectors-multi-sdk/ for more
> information.
>
> On Thu, May 28, 2020 at 4:53 PM Rui Wang  wrote:
>
>> +1 on dedicated components for cross-language transform. It might be easy
>> to manage to have one component (one tag for all SDK) rather than
>> multiple ones.
>>
>>
>> Re Ashwin,
>>
>> Cham knows more than me. AFAIK, cross-language transforms will maximize
>> code reuse for newly developed SDK (e.g. IO transforms for Go SDK). Of
>> course, a SDK can develop its own IOs, but it's lots of work.
>>
>>
>> -Rui
>>
>> On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami 
>> wrote:
>>
>>> What are some of the benefits / drawbacks of using cross-language
>>> transforms? Would a native Python transform perform better than a
>>> cross-language transform written in Java that is then used in a Python
>>> pipeline?
>>>
>>> Ashwin Ramaswami
>>> Student
>>> *Find me on my:* LinkedIn  |
>>> Website  | GitHub
>>> 
>>>
>>>
>>> On Thu, May 28, 2020 at 4:44 PM Kyle Weaver  wrote:
>>>
 SGTM. Though I'm not sure it's necessary to split by language. It might
 be easier to use a single cross-language tag, rather than having to tag
 lots of issues as both sdks-python-xlang and sdks-java-xlang.

 On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Hi All,
>
> I think it's good if we can have new Jira components to easily track
> various issues related to cross-language transforms.
>
> What do you think about adding the following Jira components ?
>
> sdks-python-xlang
> sdks-java-xlang
> sdks-go-xlang
>
> Jira component sdks-foo-xlang is for tracking issues related to
> cross-language transforms for SDK Foo. For example,
> * Issues related cross-language transforms wrappers written in SDK Foo
> * Issues related to transforms implemented in SDK Foo that are
> offered as cross-language transforms to other SDKs
> * Issues related to cross-language transform expansion service
> implemented for SDK Foo
>
> Thanks,
> Cham
>



Re: Jira components for cross-language transforms

2020-05-28 Thread Kyle Weaver
> What are some of the benefits / drawbacks of using cross-language
transforms? Would a native Python transform perform better than a
cross-language transform written in Java that is then used in a Python
pipeline?

As Rui says, the main advantage is code reuse. See
https://beam.apache.org/roadmap/connectors-multi-sdk/ for more information.

On Thu, May 28, 2020 at 4:53 PM Rui Wang  wrote:

> +1 on dedicated components for cross-language transform. It might be easy
> to manage to have one component (one tag for all SDK) rather than
> multiple ones.
>
>
> Re Ashwin,
>
> Cham knows more than me. AFAIK, cross-language transforms will maximize
> code reuse for newly developed SDK (e.g. IO transforms for Go SDK). Of
> course, a SDK can develop its own IOs, but it's lots of work.
>
>
> -Rui
>
> On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami 
> wrote:
>
>> What are some of the benefits / drawbacks of using cross-language
>> transforms? Would a native Python transform perform better than a
>> cross-language transform written in Java that is then used in a Python
>> pipeline?
>>
>> Ashwin Ramaswami
>> Student
>> *Find me on my:* LinkedIn  |
>> Website  | GitHub
>> 
>>
>>
>> On Thu, May 28, 2020 at 4:44 PM Kyle Weaver  wrote:
>>
>>> SGTM. Though I'm not sure it's necessary to split by language. It might
>>> be easier to use a single cross-language tag, rather than having to tag
>>> lots of issues as both sdks-python-xlang and sdks-java-xlang.
>>>
>>> On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath 
>>> wrote:
>>>
 Hi All,

 I think it's good if we can have new Jira components to easily track
 various issues related to cross-language transforms.

 What do you think about adding the following Jira components ?

 sdks-python-xlang
 sdks-java-xlang
 sdks-go-xlang

 Jira component sdks-foo-xlang is for tracking issues related to
 cross-language transforms for SDK Foo. For example,
 * Issues related cross-language transforms wrappers written in SDK Foo
 * Issues related to transforms implemented in SDK Foo that are
 offered as cross-language transforms to other SDKs
 * Issues related to cross-language transform expansion service
 implemented for SDK Foo

 Thanks,
 Cham

>>>


Re: Jira components for cross-language transforms

2020-05-28 Thread Rui Wang
+1 on dedicated components for cross-language transform. It might be easy
to manage to have one component (one tag for all SDK) rather than
multiple ones.


Re Ashwin,

Cham knows more than me. AFAIK, cross-language transforms will maximize
code reuse for newly developed SDK (e.g. IO transforms for Go SDK). Of
course, a SDK can develop its own IOs, but it's lots of work.


-Rui

On Thu, May 28, 2020 at 1:50 PM Ashwin Ramaswami 
wrote:

> What are some of the benefits / drawbacks of using cross-language
> transforms? Would a native Python transform perform better than a
> cross-language transform written in Java that is then used in a Python
> pipeline?
>
> Ashwin Ramaswami
> Student
> *Find me on my:* LinkedIn  | Website
>  | GitHub 
>
>
> On Thu, May 28, 2020 at 4:44 PM Kyle Weaver  wrote:
>
>> SGTM. Though I'm not sure it's necessary to split by language. It might
>> be easier to use a single cross-language tag, rather than having to tag
>> lots of issues as both sdks-python-xlang and sdks-java-xlang.
>>
>> On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath 
>> wrote:
>>
>>> Hi All,
>>>
>>> I think it's good if we can have new Jira components to easily track
>>> various issues related to cross-language transforms.
>>>
>>> What do you think about adding the following Jira components ?
>>>
>>> sdks-python-xlang
>>> sdks-java-xlang
>>> sdks-go-xlang
>>>
>>> Jira component sdks-foo-xlang is for tracking issues related to
>>> cross-language transforms for SDK Foo. For example,
>>> * Issues related cross-language transforms wrappers written in SDK Foo
>>> * Issues related to transforms implemented in SDK Foo that are
>>> offered as cross-language transforms to other SDKs
>>> * Issues related to cross-language transform expansion service
>>> implemented for SDK Foo
>>>
>>> Thanks,
>>> Cham
>>>
>>


Re: Jira components for cross-language transforms

2020-05-28 Thread Ashwin Ramaswami
What are some of the benefits / drawbacks of using cross-language
transforms? Would a native Python transform perform better than a
cross-language transform written in Java that is then used in a Python
pipeline?

Ashwin Ramaswami
Student
*Find me on my:* LinkedIn  | Website
 | GitHub 


On Thu, May 28, 2020 at 4:44 PM Kyle Weaver  wrote:

> SGTM. Though I'm not sure it's necessary to split by language. It might be
> easier to use a single cross-language tag, rather than having to tag lots
> of issues as both sdks-python-xlang and sdks-java-xlang.
>
> On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath 
> wrote:
>
>> Hi All,
>>
>> I think it's good if we can have new Jira components to easily track
>> various issues related to cross-language transforms.
>>
>> What do you think about adding the following Jira components ?
>>
>> sdks-python-xlang
>> sdks-java-xlang
>> sdks-go-xlang
>>
>> Jira component sdks-foo-xlang is for tracking issues related to
>> cross-language transforms for SDK Foo. For example,
>> * Issues related cross-language transforms wrappers written in SDK Foo
>> * Issues related to transforms implemented in SDK Foo that are offered as
>> cross-language transforms to other SDKs
>> * Issues related to cross-language transform expansion service
>> implemented for SDK Foo
>>
>> Thanks,
>> Cham
>>
>


Re: Jira components for cross-language transforms

2020-05-28 Thread Kyle Weaver
SGTM. Though I'm not sure it's necessary to split by language. It might be
easier to use a single cross-language tag, rather than having to tag lots
of issues as both sdks-python-xlang and sdks-java-xlang.

On Thu, May 28, 2020 at 4:29 PM Chamikara Jayalath 
wrote:

> Hi All,
>
> I think it's good if we can have new Jira components to easily track
> various issues related to cross-language transforms.
>
> What do you think about adding the following Jira components ?
>
> sdks-python-xlang
> sdks-java-xlang
> sdks-go-xlang
>
> Jira component sdks-foo-xlang is for tracking issues related to
> cross-language transforms for SDK Foo. For example,
> * Issues related cross-language transforms wrappers written in SDK Foo
> * Issues related to transforms implemented in SDK Foo that are offered as
> cross-language transforms to other SDKs
> * Issues related to cross-language transform expansion service implemented
> for SDK Foo
>
> Thanks,
> Cham
>


Jira components for cross-language transforms

2020-05-28 Thread Chamikara Jayalath
Hi All,

I think it's good if we can have new Jira components to easily track
various issues related to cross-language transforms.

What do you think about adding the following Jira components ?

sdks-python-xlang
sdks-java-xlang
sdks-go-xlang

Jira component sdks-foo-xlang is for tracking issues related to
cross-language transforms for SDK Foo. For example,
* Issues related cross-language transforms wrappers written in SDK Foo
* Issues related to transforms implemented in SDK Foo that are offered as
cross-language transforms to other SDKs
* Issues related to cross-language transform expansion service implemented
for SDK Foo

Thanks,
Cham


Re: writing new IO with Maven dependencies

2020-05-28 Thread Luke Cwik
+dev 

On Thu, May 28, 2020 at 11:55 AM Ken Barr  wrote:

> I am currently developing an IO that I would like to eventually submit to
> Apache Beam project.  The IO itself is Apache2.0 licensed.
> Does every chained dependency I use need to be opensource?
>

The transitive dependency tree must have licenses from ASFs approved
license list. See https://www.apache.org/legal/resolved.html for all the
details.


> If yes, how is this usually proven?
>

Typically the reviewer will ask you to provide the dependency tree and the
licenses of those dependencies if the reviewer doesn't do this themselves
or recognize the dependency itself. The reviewer will validate any
information that you provide.


> Is it enough that only Maven dependencies are used?
>
No.


Re: Kotlin Type Inference Issue for Primitives in DoFn

2020-05-28 Thread Reuven Lax
This means that the TypeDescriptors don't match. It could be something
weird with the Int type, or it could be Kotlin not propagating the generic
type parameters of the DoFn.

On Thu, May 28, 2020 at 8:03 AM Rion Williams  wrote:

> Hi Reuvan,
>
> Here's the complete stack trace:
>
> Exception in thread "main" java.lang.IllegalArgumentException: Type of
> @Element must match the DoFn typeCreate.Values/Read(CreateSource).out
> [PCollection]
> at
> org.apache.beam.sdk.transforms.ParDo.getDoFnSchemaInformation(ParDo.java:601)
> at
> org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:190)
> at
> org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$ParDoTranslator.translate(ParDoTranslation.java:128)
> at
> org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:225)
> at
> org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getParDoPayload(ParDoTranslation.java:689)
> at
> org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.isSplittable(ParDoTranslation.java:704)
> at
> org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformMatchers$6.matches(PTransformMatchers.java:269)
> at
> org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform(Pipeline.java:282)
> at
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:665)
> at
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
> at
> org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
> at
> org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:317)
> at
> org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:251)
> at
> org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:460)
> at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:260)
> at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210)
> at
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:170)
> at
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
>
> As mentioned earlier, I don't know if this is what should be expected (or
> if it's something worth addressing) within Beam or if the preferred
> approach would be to simply always rely on the use of the ProcessContext if
> you aren't natively writing your Beam applications in Kotlin.
>
> On 2020/05/27 20:30:43, Reuven Lax  wrote:
> > It could also be that Kotlin is defeating Beam's type analysis, if it
> > changes type-parameter ordering for example. It may also be that the
> > TypeToken framework we use for analyzing Java types isn't working
> properly
> > on these Kotlin types.
> >
> > On Wed, May 27, 2020 at 1:27 PM Reuven Lax  wrote:
> >
> > > Do you have the full stack trace from that exception?
> > >
> > > On Wed, May 27, 2020 at 1:13 PM Rion Williams 
> > > wrote:
> > >
> > >> Correct, Kotlin uses an Int type as opposed to Java’s integer,
> however in
> > >> this case I had assumed that since the PCollection being constructed
> and
> > >> used by the DoFn both use the same Kotlin Int type that it would be
> able to
> > >> bind properly (even when explicitly typing the Create to use the
> Kotlin
> > >> type).
> > >>
> > >> When doing the same thing with Kotlin Strings, the @Element attribute
> > >> works as expected, so I don’t know if this is an issue purely related
> to
> > >> underlying type conversions with numeric Kotlin types and what’s the
> best
> > >> way to handle this? I know using the ProcessContext works just as
> you’d
> > >> expect, however for simple transforms the @Element approach can be a
> bit
> > >> easier to grok.
> > >>
> > >> On May 27, 2020, at 3:01 PM, Reuven Lax  wrote:
> > >>
> > >> 
> > >> I'm assuming that Kotlin has its own type for Int, which is not the
> same
> > >> as Java's Integer type.
> > >>
> > >> On Fri, May 22, 2020 at 8:19 AM Rion Williams 
> > >> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> I was writing a very simple transform in Kotlin as follows that
> takes in
> > >>> a series of integers and applies a simply DoFn against them:
> > >>>
> > >>> pipeline
> > >>> .apply(Create.of(1, 2, 3))
> > >>> .apply(ParDo.of(object: DoFn(){
> > >>> @ProcessElement
> > >>> fun processElement(@Element element: Int){
> > >>> // Omitted for brevity
> > >>> }
> > >>> })
> > >>> )
> > >>>
> > >>> The issue seems to arise when we use the `@Element` attribute on the
> > >>> 

Re: Semantic versioning

2020-05-28 Thread Luke Cwik
Updating our documentation makes sense.

The backwards compat discussion is an interesting read. One of the points
that they mention is that they like Spark users to be on the latest Spark.
I can say that this is also true for Dataflow where we want users to be on
the latest version of Beam. In Beam, I have seen that backwards
compatibility is hard because the APIs that users use to construct their
pipeline and what their functions use when the pipeline is executing reach
into the internals of Beam and/or runners and I was wondering whether Spark
was hitting the same issues in this regard?

With portability and the no knobs philosophy, I can see that we should be
able to relax which version of a runner is being used a lot more from what
version of Beam is being used so we might want to go in a different
direction then what was proposed in the Spark thread as well since we may
be able to achieve a greater level of decoupling.


On Thu, May 28, 2020 at 9:18 AM Ismaël Mejía  wrote:

> I am surprised that we are claiming in the Beam website to use semantic
> versioning (semver) [1] in Beam [2]. We have NEVER really followed semantic
> versioning and we have broken multiple times both internal and external
> APIs (at
> least for Java) as you can find in this analysis of source and binary
> compatibility between beam versions that I did for ‘sdks/java/core’ two
> months
> ago in the following link:
>
>
> https://cloudflare-ipfs.com/ipfs/QmQSkWYmzerpUjT7fhE9CF7M9hm2uvJXNpXi58mS8RKcNi/
>
> This report was produced by running the following script that excludes both
> @Experimental and @Internal annotations as well as many internal packages
> like
> ‘sdk/util/’, ‘transforms/reflect/’ and ‘sdk/testing/’ among others, for
> more
> details on the exclusions refer to this script code:
>
> https://gist.github.com/iemejia/5277fc269c63c4e49f1bb065454a895e
>
> Respecting semantic versioning is REALLY HARD and a strong compromise that
> may
> bring both positive and negative impact to the project, as usual it is all
> about
> trade-offs. Semver requires tooling that we do not have yet in place to
> find
> regressions before releases to fix them (or to augment major versions to
> respect
> the semver contract). We as a polyglot project need these tools for every
> supported language, and since all our languages live in the same
> repository and
> are released simultaneously an incompatible change in one language may
> trigger a
> full new major version number for the whole project which does not look
> like a
> desirable outcome.
>
> For these reasons I think we should soften the claim of using semantic
> versioning claim and producing our own Beam semantic versioning policy
> that is
> consistent with our reality where we can also highlight the lack of
> guarantees
> for code marked as @Internal and @Experimental as well as for some modules
> where
> we may be interested on still having the freedom of not guaranteeing
> stability
> like runners/core* or any class in the different runners that is not a
> PipelineOptions one.
>
> In general whatever we decide we should probably not be as strict but
> consider
> in detail the tradeoffs of the policy. There is an ongoing discussion on
> versioning in the Apache Spark community that is really worth the read and
> proposes an analysis between Costs to break and API vs costs to maintain
> an API
> [3]. I think we can use it as an inspiration for an initial version.
>
> WDYT?
>
> [1] https://semver.org/
> [2] https://beam.apache.org/get-started/downloads/
> [3]
> https://lists.apache.org/thread.html/r82f99ad8c2798629eed66d65f2cddc1ed196dddf82e8e9370f3b7d32%40%3Cdev.spark.apache.org%3E
>
>
> On Thu, May 28, 2020 at 4:36 PM Reuven Lax  wrote:
>
>> Most of those items are either in APIs marked @Experimental (the
>> definition of Experimental in Beam is that we can make breaking changes to
>> the API) or are changes in a specific runner - not the Beam API.
>>
>> Reuven
>>
>> On Thu, May 28, 2020 at 7:19 AM Ashwin Ramaswami 
>> wrote:
>>
>>> There's a "Breaking Changes" section on this blogpost:
>>> https://beam.apache.org/blog/beam-2.21.0/ (and really, for earlier
>>> minor versions too)
>>>
>>> Ashwin Ramaswami
>>> Student
>>> *Find me on my:* LinkedIn  |
>>> Website  | GitHub
>>> 
>>>
>>>
>>> On Thu, May 28, 2020 at 10:01 AM Reuven Lax  wrote:
>>>
 What did we break?

 On Thu, May 28, 2020, 6:31 AM Ashwin Ramaswami 
 wrote:

> Do we really use semantic versioning? It appears we introduced
> breaking changes from 2.20.0 -> 2.21.0. If not, we should update the
> documentation under "API Stability" on this page:
> https://beam.apache.org/get-started/downloads/
>
> What would be a better way to word the way in which we decide version
> numbering?
>



Re: SQL Windowing

2020-05-28 Thread Maximilian Michels
Thanks for the quick reply Brian! I've filed a JIRA for option (a):
https://jira.apache.org/jira/browse/BEAM-10143

Makes sense to define DATETIME as a logical type. I'll check out your
PR. We could work around this for now by doing a cast, e.g.:

  TUMBLE(CAST(f_timestamp AS DATETIME), INTERVAL '30' MINUTE)

Note that we may have to do a more sophisticated cast to convert the
Python micros into a DATETIME.

-Max

On 28.05.20 19:18, Brian Hulette wrote:
> Hey Max,
> Thanks for kicking the tires on SqlTransform in Python :)
> 
> We don't have any tests of windowing and Sql in Python yet, so I'm not
> that surprised you're running into issues here. Portable schemas don't
> support the DATETIME type, because we decided not to define it as one of
> the atomic types [1] and hope to add support via a logical type instead
> (see BEAM-7554 [2]). This was the motivation for the MillisInstant PR I
> put up, and the ongoing discussion [3].
> Regardless, that should only be an obstacle for option (b), where you'd
> need to have a DATETIME in the input and/or output PCollection of the
> SqlTransform. In theory option (a) should be possible, so I'd consider
> that a bug - can you file a jira for it?
> 
> Brian
> 
> [1] 
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto#L58
> [2] https://issues.apache.org/jira/browse/BEAM-7554
> [3] 
> https://lists.apache.org/thread.html/r2e05355b74fb5b8149af78ade1e3539ec08371a9a4b2b9e45737e6be%40%3Cdev.beam.apache.org%3E
> 
> On Thu, May 28, 2020 at 9:45 AM Maximilian Michels  > wrote:
> 
> Hi,
> 
> I'm using the SqlTransform as an external transform from within a Python
> pipeline. The SQL docs [1] mention that you can either (a) window the
> input or (b) window in the SQL query.
> 
> Option (a):
> 
>   input
>       | "Window >> beam.WindowInto(window.FixedWindows(30))
>       | "Aggregate" >>
>       SqlTransform("""Select field, count(field) from PCOLLECTION
>                       WHERE ...
>                       GROUP BY field
>                    """)
> 
> This results in an exception:
> 
>   Caused by: java.lang.ClassCastException:
>   org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast
>   to org.apache.beam.sdk.transforms.windowing.GlobalWindow
> 
> => Is this a bug?
> 
> 
> Let's try Option (b):
> 
>   input
>       | "Aggregate & Window" >>
>       SqlTransform("""Select field, count(field) from PCOLLECTION
>                       WHERE ...
>                       GROUP BY field,
>                                TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
>                    """)
> 
> The issue that I'm facing here is that the timestamp is already assigned
> to my values but is not exposed as a field. So I need to use a DoFn to
> extract the timestamp as a new field:
> 
>   class GetTimestamp(beam.DoFn):
>     def process(self, event, timestamp=beam.DoFn.TimestampParam):
>       yield TimestampedRow(..., timestamp)
> 
>   input
>       | "Extract timestamp" >>
>       beam.ParDo(GetTimestamp())
>       | "Aggregate & Window" >>
>       SqlTransform("""Select field, count(field) from PCOLLECTION
>                       WHERE ...
>                       GROUP BY field,
>                                TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
>                    """)
> 
> => It would be very convenient if there was a reserved field name which
> would point to the timestamp of an element. Maybe there is?
> 
> 
> -Max
> 
> 
> [1]
> 
> https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/
> 


Re: Python Cross-language wrappers for Java IOs

2020-05-28 Thread Piotr Szuberski



On 2020/05/28 16:54:47, Piotr Szuberski  wrote: 
> I added to Jira task of creating cross-language wrappers for Java IOs. It 
> will soon be in progress.
> https://issues.apache.org/jira/browse/BEAM-10134


Re: Python Cross-language wrappers for Java IOs

2020-05-28 Thread Chamikara Jayalath
Great. Thanks for working on this. Can you please add these tasks and JIRAs
to the cross-language transforms roadmap under "Connector/transform
support".
https://beam.apache.org/roadmap/connectors-multi-sdk/

Happy to help if you run into any issues during this task.

Thanks,
Cham

On Thu, May 28, 2020 at 9:59 AM Piotr Szuberski 
wrote:

> I added to Jira task of creating cross-language wrappers for Java IOs. It
> will soon be in progress.
>


Re: [ANNOUNCE] Beam 2.21.0 Released

2020-05-28 Thread Udi Meiri
Woohoo!

On Thu, May 28, 2020 at 4:16 AM Kyle Weaver  wrote:

> The Apache Beam team is pleased to announce the release of version 2.21.0.
>
> Apache Beam is an open source unified programming model to define and
> execute data processing pipelines, including ETL, batch and stream
> (continuous) processing. See https://beam.apache.org
>
> You can download the release here:
>
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements detailed on
> the Beam blog: https://beam.apache.org/blog/beam-2.21.0/
>
> Thanks to everyone who contributed to this release, and we hope you enjoy
> using Beam 2.21.0.
> -- Kyle Weaver, on behalf of The Apache Beam team
>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: SQL Windowing

2020-05-28 Thread Brian Hulette
Hey Max,
Thanks for kicking the tires on SqlTransform in Python :)

We don't have any tests of windowing and Sql in Python yet, so I'm not that
surprised you're running into issues here. Portable schemas don't support
the DATETIME type, because we decided not to define it as one of the atomic
types [1] and hope to add support via a logical type instead (see BEAM-7554
[2]). This was the motivation for the MillisInstant PR I put up, and the
ongoing discussion [3].
Regardless, that should only be an obstacle for option (b), where you'd
need to have a DATETIME in the input and/or output PCollection of the
SqlTransform. In theory option (a) should be possible, so I'd consider that
a bug - can you file a jira for it?

Brian

[1]
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto#L58
[2] https://issues.apache.org/jira/browse/BEAM-7554
[3]
https://lists.apache.org/thread.html/r2e05355b74fb5b8149af78ade1e3539ec08371a9a4b2b9e45737e6be%40%3Cdev.beam.apache.org%3E

On Thu, May 28, 2020 at 9:45 AM Maximilian Michels  wrote:

> Hi,
>
> I'm using the SqlTransform as an external transform from within a Python
> pipeline. The SQL docs [1] mention that you can either (a) window the
> input or (b) window in the SQL query.
>
> Option (a):
>
>   input
>   | "Window >> beam.WindowInto(window.FixedWindows(30))
>   | "Aggregate" >>
>   SqlTransform("""Select field, count(field) from PCOLLECTION
>   WHERE ...
>   GROUP BY field
>""")
>
> This results in an exception:
>
>   Caused by: java.lang.ClassCastException:
>   org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast
>   to org.apache.beam.sdk.transforms.windowing.GlobalWindow
>
> => Is this a bug?
>
>
> Let's try Option (b):
>
>   input
>   | "Aggregate & Window" >>
>   SqlTransform("""Select field, count(field) from PCOLLECTION
>   WHERE ...
>   GROUP BY field,
>TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
>""")
>
> The issue that I'm facing here is that the timestamp is already assigned
> to my values but is not exposed as a field. So I need to use a DoFn to
> extract the timestamp as a new field:
>
>   class GetTimestamp(beam.DoFn):
> def process(self, event, timestamp=beam.DoFn.TimestampParam):
>   yield TimestampedRow(..., timestamp)
>
>   input
>   | "Extract timestamp" >>
>   beam.ParDo(GetTimestamp())
>   | "Aggregate & Window" >>
>   SqlTransform("""Select field, count(field) from PCOLLECTION
>   WHERE ...
>   GROUP BY field,
>TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
>""")
>
> => It would be very convenient if there was a reserved field name which
> would point to the timestamp of an element. Maybe there is?
>
>
> -Max
>
>
> [1]
>
> https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/
>


Python Cross-language wrappers for Java IOs

2020-05-28 Thread Piotr Szuberski
I added to Jira task of creating cross-language wrappers for Java IOs. It will 
soon be in progress.


Re: Proposal for reading from / writing to archive files

2020-05-28 Thread Robert Bradshaw
On Thu, May 28, 2020 at 9:34 AM Chamikara Jayalath 
wrote:

> Thanks for the contribution. This sounds very interesting. Few comments.
>
> * | fileio.MatchFiles('hdfs://path/to/*.zip') | fileio.ExtractMatches() |
> fileio.MatchAll()
>
> We usually either do 'fileio.MatchFiles('hdfs://path/to/*.zip')' or
> 'fileio.MatchAll()'. Former to read a specific glob and latter to read a
> PCollection of glob. We also have support for reading compressed files. We
> should add to that API instead of using both.
>
> * ArchiveSystem with list() and extract().
>
> Is this something we can add to the existing FileSystems abstraction
> instead of introducing a new abstraction ?
>

+1

In particular, something like
zip://hdfs://path/to/zip:glob/within/zip/*.txt could be a new zipfile
filesystem that can support parallel reads and delegate to any other
filesystem. One could then write

  p | fileio.MatchFiles('hdfs://path/to/*.zip')  # produces a PCollection
of zip file paths
| fileio.ExtractMatches()  # produces a PCollection of zip file
entries, using a zipfile filesystem
| fileio.ReadMatches()  # actually reads the files. One could to a text
read, or whatever, here as well.
| ...

Note that tar files do not support random access (or even listing without
reading the entire contents), so are poorly suited for this.


> *
> fileio.CompressMatches
> fileio.WriteToArchive
>
> Is this scalable for a distributed system ? Usually we write a file per
> bundle.
>
> I suggest writing a doc with some background research related to how other
> data processing systems achieve this functionality so that we can try to
> determine if the functionality can be added to the existing API somehow.
>

Yeah, zip files are not writable in parallel. One /could/ do the
compression in parallel, and then have a final "writer" that just does
concat (with the appropriate headers) to the final zipfile(s).

On Wed, May 27, 2020 at 9:10 AM Ashwin Ramaswami 
> wrote:
>
>> I have a requirement where I need to read from / write to archive files
>> (such as .tar, .zip). Essentially, I'd like to treat the entire .zip file I
>> read from as a filesystem, so that I can only get the files I need that are
>> within the archive. This is useful, because some archive formats such as
>> .zip allow random access (so one does not need to read the entire zip file
>> in order to just read a single file from it).
>>
>> I've made an issue outlining how this might be designed -- would
>> appreciate any feedback or thoughts about how this might work!
>> https://issues.apache.org/jira/browse/BEAM-10111
>>
>


SQL Windowing

2020-05-28 Thread Maximilian Michels
Hi,

I'm using the SqlTransform as an external transform from within a Python
pipeline. The SQL docs [1] mention that you can either (a) window the
input or (b) window in the SQL query.

Option (a):

  input
  | "Window >> beam.WindowInto(window.FixedWindows(30))
  | "Aggregate" >>
  SqlTransform("""Select field, count(field) from PCOLLECTION
  WHERE ...
  GROUP BY field
   """)

This results in an exception:

  Caused by: java.lang.ClassCastException:
  org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast
  to org.apache.beam.sdk.transforms.windowing.GlobalWindow

=> Is this a bug?


Let's try Option (b):

  input
  | "Aggregate & Window" >>
  SqlTransform("""Select field, count(field) from PCOLLECTION
  WHERE ...
  GROUP BY field,
   TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
   """)

The issue that I'm facing here is that the timestamp is already assigned
to my values but is not exposed as a field. So I need to use a DoFn to
extract the timestamp as a new field:

  class GetTimestamp(beam.DoFn):
def process(self, event, timestamp=beam.DoFn.TimestampParam):
  yield TimestampedRow(..., timestamp)

  input
  | "Extract timestamp" >>
  beam.ParDo(GetTimestamp())
  | "Aggregate & Window" >>
  SqlTransform("""Select field, count(field) from PCOLLECTION
  WHERE ...
  GROUP BY field,
   TUMBLE(f_timestamp, INTERVAL '30' MINUTE)
   """)

=> It would be very convenient if there was a reserved field name which
would point to the timestamp of an element. Maybe there is?


-Max


[1]
https://beam.apache.org/documentation/dsls/sql/extensions/windowing-and-triggering/


Re: Proposal for reading from / writing to archive files

2020-05-28 Thread Chamikara Jayalath
Thanks for the contribution. This sounds very interesting. Few comments.

* | fileio.MatchFiles('hdfs://path/to/*.zip') | fileio.ExtractMatches() |
fileio.MatchAll()

We usually either do 'fileio.MatchFiles('hdfs://path/to/*.zip')' or
'fileio.MatchAll()'. Former to read a specific glob and latter to read a
PCollection of glob. We also have support for reading compressed files. We
should add to that API instead of using both.

* ArchiveSystem with list() and extract().

Is this something we can add to the existing FileSystems abstraction
instead of introducing a new abstraction ?

*
fileio.CompressMatches
fileio.WriteToArchive

Is this scalable for a distributed system ? Usually we write a file per
bundle.

I suggest writing a doc with some background research related to how other
data processing systems achieve this functionality so that we can try to
determine if the functionality can be added to the existing API somehow.

Thanks,
Cham






On Wed, May 27, 2020 at 9:10 AM Ashwin Ramaswami 
wrote:

> I have a requirement where I need to read from / write to archive files
> (such as .tar, .zip). Essentially, I'd like to treat the entire .zip file I
> read from as a filesystem, so that I can only get the files I need that are
> within the archive. This is useful, because some archive formats such as
> .zip allow random access (so one does not need to read the entire zip file
> in order to just read a single file from it).
>
> I've made an issue outlining how this might be designed -- would
> appreciate any feedback or thoughts about how this might work!
> https://issues.apache.org/jira/browse/BEAM-10111
>


Re: Semantic versioning

2020-05-28 Thread Ismaël Mejía
I am surprised that we are claiming in the Beam website to use semantic
versioning (semver) [1] in Beam [2]. We have NEVER really followed semantic
versioning and we have broken multiple times both internal and external
APIs (at
least for Java) as you can find in this analysis of source and binary
compatibility between beam versions that I did for ‘sdks/java/core’ two
months
ago in the following link:

https://cloudflare-ipfs.com/ipfs/QmQSkWYmzerpUjT7fhE9CF7M9hm2uvJXNpXi58mS8RKcNi/

This report was produced by running the following script that excludes both
@Experimental and @Internal annotations as well as many internal packages
like
‘sdk/util/’, ‘transforms/reflect/’ and ‘sdk/testing/’ among others, for more
details on the exclusions refer to this script code:

https://gist.github.com/iemejia/5277fc269c63c4e49f1bb065454a895e

Respecting semantic versioning is REALLY HARD and a strong compromise that
may
bring both positive and negative impact to the project, as usual it is all
about
trade-offs. Semver requires tooling that we do not have yet in place to find
regressions before releases to fix them (or to augment major versions to
respect
the semver contract). We as a polyglot project need these tools for every
supported language, and since all our languages live in the same repository
and
are released simultaneously an incompatible change in one language may
trigger a
full new major version number for the whole project which does not look
like a
desirable outcome.

For these reasons I think we should soften the claim of using semantic
versioning claim and producing our own Beam semantic versioning policy that
is
consistent with our reality where we can also highlight the lack of
guarantees
for code marked as @Internal and @Experimental as well as for some modules
where
we may be interested on still having the freedom of not guaranteeing
stability
like runners/core* or any class in the different runners that is not a
PipelineOptions one.

In general whatever we decide we should probably not be as strict but
consider
in detail the tradeoffs of the policy. There is an ongoing discussion on
versioning in the Apache Spark community that is really worth the read and
proposes an analysis between Costs to break and API vs costs to maintain an
API
[3]. I think we can use it as an inspiration for an initial version.

WDYT?

[1] https://semver.org/
[2] https://beam.apache.org/get-started/downloads/
[3]
https://lists.apache.org/thread.html/r82f99ad8c2798629eed66d65f2cddc1ed196dddf82e8e9370f3b7d32%40%3Cdev.spark.apache.org%3E


On Thu, May 28, 2020 at 4:36 PM Reuven Lax  wrote:

> Most of those items are either in APIs marked @Experimental (the
> definition of Experimental in Beam is that we can make breaking changes to
> the API) or are changes in a specific runner - not the Beam API.
>
> Reuven
>
> On Thu, May 28, 2020 at 7:19 AM Ashwin Ramaswami 
> wrote:
>
>> There's a "Breaking Changes" section on this blogpost:
>> https://beam.apache.org/blog/beam-2.21.0/ (and really, for earlier minor
>> versions too)
>>
>> Ashwin Ramaswami
>> Student
>> *Find me on my:* LinkedIn  |
>> Website  | GitHub
>> 
>>
>>
>> On Thu, May 28, 2020 at 10:01 AM Reuven Lax  wrote:
>>
>>> What did we break?
>>>
>>> On Thu, May 28, 2020, 6:31 AM Ashwin Ramaswami 
>>> wrote:
>>>
 Do we really use semantic versioning? It appears we introduced breaking
 changes from 2.20.0 -> 2.21.0. If not, we should update the documentation
 under "API Stability" on this page:
 https://beam.apache.org/get-started/downloads/

 What would be a better way to word the way in which we decide version
 numbering?

>>>


Re: Kotlin Type Inference Issue for Primitives in DoFn

2020-05-28 Thread Rion Williams
Hi Reuvan,

Here's the complete stack trace:

Exception in thread "main" java.lang.IllegalArgumentException: Type of @Element 
must match the DoFn typeCreate.Values/Read(CreateSource).out [PCollection]
at 
org.apache.beam.sdk.transforms.ParDo.getDoFnSchemaInformation(ParDo.java:601)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.translateParDo(ParDoTranslation.java:190)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation$ParDoTranslator.translate(ParDoTranslation.java:128)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformTranslation.toProto(PTransformTranslation.java:225)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.getParDoPayload(ParDoTranslation.java:689)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.ParDoTranslation.isSplittable(ParDoTranslation.java:704)
at 
org.apache.beam.repackaged.direct_java.runners.core.construction.PTransformMatchers$6.matches(PTransformMatchers.java:269)
at 
org.apache.beam.sdk.Pipeline$2.visitPrimitiveTransform(Pipeline.java:282)
at 
org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:665)
at 
org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
at 
org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:657)
at 
org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:317)
at 
org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:251)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:460)
at org.apache.beam.sdk.Pipeline.replace(Pipeline.java:260)
at org.apache.beam.sdk.Pipeline.replaceAll(Pipeline.java:210)
at 
org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:170)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)

As mentioned earlier, I don't know if this is what should be expected (or if 
it's something worth addressing) within Beam or if the preferred approach would 
be to simply always rely on the use of the ProcessContext if you aren't 
natively writing your Beam applications in Kotlin.

On 2020/05/27 20:30:43, Reuven Lax  wrote: 
> It could also be that Kotlin is defeating Beam's type analysis, if it
> changes type-parameter ordering for example. It may also be that the
> TypeToken framework we use for analyzing Java types isn't working properly
> on these Kotlin types.
> 
> On Wed, May 27, 2020 at 1:27 PM Reuven Lax  wrote:
> 
> > Do you have the full stack trace from that exception?
> >
> > On Wed, May 27, 2020 at 1:13 PM Rion Williams 
> > wrote:
> >
> >> Correct, Kotlin uses an Int type as opposed to Java’s integer, however in
> >> this case I had assumed that since the PCollection being constructed and
> >> used by the DoFn both use the same Kotlin Int type that it would be able to
> >> bind properly (even when explicitly typing the Create to use the Kotlin
> >> type).
> >>
> >> When doing the same thing with Kotlin Strings, the @Element attribute
> >> works as expected, so I don’t know if this is an issue purely related to
> >> underlying type conversions with numeric Kotlin types and what’s the best
> >> way to handle this? I know using the ProcessContext works just as you’d
> >> expect, however for simple transforms the @Element approach can be a bit
> >> easier to grok.
> >>
> >> On May 27, 2020, at 3:01 PM, Reuven Lax  wrote:
> >>
> >> 
> >> I'm assuming that Kotlin has its own type for Int, which is not the same
> >> as Java's Integer type.
> >>
> >> On Fri, May 22, 2020 at 8:19 AM Rion Williams 
> >> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I was writing a very simple transform in Kotlin as follows that takes in
> >>> a series of integers and applies a simply DoFn against them:
> >>>
> >>> pipeline
> >>> .apply(Create.of(1, 2, 3))
> >>> .apply(ParDo.of(object: DoFn(){
> >>> @ProcessElement
> >>> fun processElement(@Element element: Int){
> >>> // Omitted for brevity
> >>> }
> >>> })
> >>> )
> >>>
> >>> The issue seems to arise when we use the `@Element` attribute on the
> >>> element which fails with the following error:
> >>>
> >>> Exception in thread "main" java.lang.IllegalArgumentException: Type of
> >>> @Element must match the DoFn typeCreate.Values/Read(CreateSource).out
> >>> [PCollection]
> >>>
> >>> Basically, it seems that the use of the `@Element` attribute isn't able
> >>> to properly decode or recognize the Kotlin `Int`, however if we adjust the
> >>> DoFn to instead use the 

dealing with late data output timestamps

2020-05-28 Thread David Morávek
Hi,

I've came across "unexpected" model behaviour when dealing with late data
and custom timestamp combiners. Let's take a following pipeline as an
example:

final PCollection input = ...;
input.apply(
  "GlobalWindows",
  Window.into(new GlobalWindows())
  .triggering(
  AfterWatermark.pastEndOfWindow()
  .withEarlyFirings(
  AfterProcessingTime.pastFirstElementInPane()
  .plusDelayOf(Duration.standardSeconds(10
  .withTimestampCombiner(TimestampCombiner.LATEST)
  .withOnTimeBehavior(Window.OnTimeBehavior.FIRE_IF_NON_EMPTY)
  .accumulatingFiredPanes())
  .apply("Aggregate", Count.perElement())

The above pipeline emits updates with the latest input timestamp it has
seen so far (from non-late elements). We write the output from this
timestamp to kafka and read it from another pipeline.

Problem comes when we need to handle late elements behind output watermark.
In this case beam can not use combined timestamp and uses EOW timestamp
instead. Unfortunately this results in downstream pipeline progressing it's
input watermark to end of global window. Also if we would use fixed windows
after this aggregation, it would yield unexpected results.

There is no reasoning about this behaviour in the last section of lateness
design doc  [1], so I'd like to open a
discussion about what the expected result should be.

My personal opinion is, that correct approach would be emitting late
elements with currentOutputWatermark rather than EOW in case of EARLIEST
and LATEST timestamp combiners.

I've prepared a faling test case for ReduceFnRunner
,
if anyone wants to play around with the issue.

I also think that BEAM-2262
 [2] may be related to
this discussion.

[1] https://s.apache.org/beam-lateness
[2] https://issues.apache.org/jira/browse/BEAM-2262
[3]
https://github.com/dmvk/beam/commit/c93cd26681aa6fbc83c15d2a7a8146287f1e850b

Looking forward to hearing your thoughts.

Thanks,
D.


Re: Semantic versioning

2020-05-28 Thread Reuven Lax
Most of those items are either in APIs marked @Experimental (the definition
of Experimental in Beam is that we can make breaking changes to the API) or
are changes in a specific runner - not the Beam API.

Reuven

On Thu, May 28, 2020 at 7:19 AM Ashwin Ramaswami 
wrote:

> There's a "Breaking Changes" section on this blogpost:
> https://beam.apache.org/blog/beam-2.21.0/ (and really, for earlier minor
> versions too)
>
> Ashwin Ramaswami
> Student
> *Find me on my:* LinkedIn  | Website
>  | GitHub 
>
>
> On Thu, May 28, 2020 at 10:01 AM Reuven Lax  wrote:
>
>> What did we break?
>>
>> On Thu, May 28, 2020, 6:31 AM Ashwin Ramaswami 
>> wrote:
>>
>>> Do we really use semantic versioning? It appears we introduced breaking
>>> changes from 2.20.0 -> 2.21.0. If not, we should update the documentation
>>> under "API Stability" on this page:
>>> https://beam.apache.org/get-started/downloads/
>>>
>>> What would be a better way to word the way in which we decide version
>>> numbering?
>>>
>>


Re: Semantic versioning

2020-05-28 Thread Ashwin Ramaswami
There's a "Breaking Changes" section on this blogpost:
https://beam.apache.org/blog/beam-2.21.0/ (and really, for earlier minor
versions too)

Ashwin Ramaswami
Student
*Find me on my:* LinkedIn  | Website
 | GitHub 


On Thu, May 28, 2020 at 10:01 AM Reuven Lax  wrote:

> What did we break?
>
> On Thu, May 28, 2020, 6:31 AM Ashwin Ramaswami 
> wrote:
>
>> Do we really use semantic versioning? It appears we introduced breaking
>> changes from 2.20.0 -> 2.21.0. If not, we should update the documentation
>> under "API Stability" on this page:
>> https://beam.apache.org/get-started/downloads/
>>
>> What would be a better way to word the way in which we decide version
>> numbering?
>>
>


Re: Semantic versioning

2020-05-28 Thread Reuven Lax
What did we break?

On Thu, May 28, 2020, 6:31 AM Ashwin Ramaswami 
wrote:

> Do we really use semantic versioning? It appears we introduced breaking
> changes from 2.20.0 -> 2.21.0. If not, we should update the documentation
> under "API Stability" on this page:
> https://beam.apache.org/get-started/downloads/
>
> What would be a better way to word the way in which we decide version
> numbering?
>


Semantic versioning

2020-05-28 Thread Ashwin Ramaswami
Do we really use semantic versioning? It appears we introduced breaking changes 
from 2.20.0 -> 2.21.0. If not, we should update the documentation under "API 
Stability" on this page: https://beam.apache.org/get-started/downloads/

What would be a better way to word the way in which we decide version numbering?


[ANNOUNCE] Beam 2.21.0 Released

2020-05-28 Thread Kyle Weaver
The Apache Beam team is pleased to announce the release of version 2.21.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on
the Beam blog: https://beam.apache.org/blog/beam-2.21.0/

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.21.0.
-- Kyle Weaver, on behalf of The Apache Beam team


Re: What's the purpose of version=2.20.0-RC2 in gradle.properties?

2020-05-28 Thread Maximilian Michels
> I would expect the release branch to have the next -SNAPSHOT version (not the 
> case currently):

Why would the release branch have the next version? It is created for
the sole purpose of releasing the current version. For example, the
release branch for 2.21.0 would have the version 2.21.0-SNAPSHOT. If we
were to release 2.21.1 or 2.22.0, we would create a new branch where the
same logic applies.

The release branch having a -SNAPSHOT version makes perfect sense
because it is a snapshot of what is going to be released (still subject
to changes). Contrary to what I said before, I don't think we should
remove the snapshot suffix from the release branch.

However, as pointed out, the source release and its tag should have a
non-snapshot version.

-Max

On 27.05.20 05:02, Thomas Weise wrote:
> 
> 
> I think the "set_version.sh" script could be called in the release
> scripts to remove the -SNAPSHOT suffix on the release branch.
> 
> 
> I would expect the release branch to have the next -SNAPSHOT version
> (not the case currently):
> 
> https://github.com/apache/beam/blob/release-2.20.0/gradle.properties#L26
> 
> Release tag and the source archive should have the actually released
> version (not -RC):
> 
> https://github.com/apache/beam/blob/v2.20.0/gradle.properties#L26
> 
> 
>  
> 
> Btw, in case you haven't seen it, here is our release guide:
> https://beam.apache.org/contribute/release-guide/
> 
> -Max
> 
> On 26.05.20 19:02, Jacek Laskowski wrote:
> > Hi Max,
> >
> >> I think you bring up a good point, for the sake of release build
> > reproducibility, we may want to remove the snapshot suffix for the
> > source release.
> >
> > Wish I could be as clear as yourself with this. Yes, that's what I've
> > been bothered about. Is there a JIRA issue for this already? I've
> never
> > been good at releases but certainly could help a bit here and there
> > since I'm interested in having reproducible builds (from the tags).
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > "The Internals Of" Online Books 
> > Follow me on https://twitter.com/jaceklaskowski
> >
> > 
> >
> >
> > On Tue, May 26, 2020 at 5:37 PM Maximilian Michels  
> > >> wrote:
> >
> >     If you really want to work with the source code, I'd recommend
> using the
> >     released source code:
> >     https://beam.apache.org/get-started/downloads/#releases
> >
> >     Even there the version in gradle.properties says
> x.y.z-SNAPSHOT. You may
> >     want to remove the -SNAPSHOT suffix. I understand that this is
> confusing
> >     but that's how our release tooling currently works; it removes the
> >     snapshot suffix during publishing the artifacts.
> >
> >     I think you bring up a good point, for the sake of release build
> >     reproducibility, we may want to remove the snapshot suffix for the
> >     source release.
> >
> >     Best,
> >     Max
> >
> >     On 26.05.20 17:20, Kyle Weaver wrote:
> >     >> When we release the version, the RC suffix is dropped.
> >     >
> >     > I think this might not actually be true, at least for the
> git tag,
> >     since
> >     > we just copy the tag from the accepted RC without changing
> anything.
> >     > However, it might not matter because RC2 artifacts should be
> identical
> >     > to the final release artifacts.
> >     >
> >     >> In other words, how to check out the sources of Beam 2.20.0
> and build
> >     > them to get the released artifacts?
> >     >
> >     > As Max said, we build and publish artifacts (Jars, Docker
> containers,
> >     > Python wheels, etc.) for each release, so it usually isn't
> >     necessary to
> >     > build them oneself unless you are testing on head or other
> >     unreleased code.
> >     >
> >     > On Tue, May 26, 2020 at 6:02 AM Jacek Laskowski
> mailto:ja...@japila.pl>
> >     >
> >     > 
>  >     >
> >     >     Hi Max,
> >     >
> >     >     > You probably want to work with the release artifacts,
> instead of
> >     >     cloning
> >     >     > the development branch.
> >     >
> >     >     I'm not sure I understand.
> >     >
> >     >     I did the following to work with the sources of v2.20.0. Am
> >     >     I missing something?
> >     >
> >     >     git fetch --all --tags --prune
> >     >     git checkout -b v2.20.0 v2.20.0
> >     >