Re: [YAML] Aggregations

2023-10-18 Thread Robert Burke
MongoDB has its own concept of aggregation pipelines as well.

https://www.mongodb.com/docs/manual/core/aggregation-pipeline/#std-label-aggregation-pipeline


On Wed, Oct 18, 2023, 6:07 PM Robert Bradshaw via dev 
wrote:

> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis  wrote:
> >
> > Is it worth taking a look at similar prior art in the space?
>
> +1. Pointers welcome.
>
> > The first one that comes to mind is Transform, but with the dbt labs
> acquisition that spec is a lot harder to find. Rill is pretty similar
> though.
>
> Rill seems to be very SQL-based.
>
> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> feature for even writing a WordCount is the ability to do Aggregations
> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> data (which has some precedence in our other languages, see the links
> >> below). The key components the user needs to specify are (1) the key
> >> fields on which the grouping will take place, (2) the fields
> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> fn to use.
> >>
> >> A straw-man example could be something like
> >>
> >> type: Aggregating
> >> config:
> >>   key: [field1, field2]
> >>   aggregating:
> >> total_cost:
> >>   fn: sum
> >>   value: cost
> >> max_cost:
> >>   fn: max
> >>   value: cost
> >>
> >> This would basically correspond to the SQL expression
> >>
> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> >> from table GROUP BY field1, field2"
> >>
> >> (though I'm not requiring that we use this as an implementation
> >> strategy). I do not think we need a separate (non aggregating)
> >> Grouping operation, this can be accomplished by having a concat-style
> >> combiner.
> >>
> >> There are still some open questions here, notably around how to
> >> specify the aggregation fns themselves. We could of course provide a
> >> number of built-ins (like SQL does). This gets into the question of
> >> how and where to document this complete set, but some basics should
> >> take us pretty far. Many aggregators, however, are parameterized (e.g.
> >> quantiles); where do we put the parameters? We could go with something
> >> like
> >>
> >> fn:
> >>   type: ApproximateQuantiles
> >>   config:
> >> n: 10
> >>
> >> but others are even configured by functions themselves (e.g. LargestN
> >> that wants a comparator Fn). Maybe we decide not to support these
> >> (yet?)
> >>
> >> One thing I think we should support, however, is referencing custom
> >> CombineFns. We have some precedent for this with our Fns from
> >> MapToFields, where we accept things like inline lambdas and external
> >> references. Again the topic of how to configure them comes up, as
> >> these custom Fns are more likely to be parameterized than Map Fns
> >> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> >> MapFns as well). Maybe we allow
> >>
> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> and match per Fn)
> >> fn:
> >>   type: ???
> >>   # should these be nested as config?
> >>   name: fully.qualiied.name
> >>   path: /path/to/defining/file
> >>   args: [...]
> >>   kwargs: {...}
> >>
> >> which would invoke the constructor.
> >>
> >> I'm also open to other ways of naming/structuring these essential
> >> parameters if it makes things more clear.
> >>
> >> - Robert
> >>
> >>
> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >>
> >> [1] One can of course use SqlTransform for this, but I'm leaning
> >> towards offering something more native.
>


Re: [YAML] Aggregations

2023-10-18 Thread Robert Bradshaw via dev
On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis  wrote:
>
> Is it worth taking a look at similar prior art in the space?

+1. Pointers welcome.

> The first one that comes to mind is Transform, but with the dbt labs 
> acquisition that spec is a lot harder to find. Rill is pretty similar though.

Rill seems to be very SQL-based.

> On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev  
> wrote:
>>
>> Beam Yaml has good support for IOs and mappings, but one key missing
>> feature for even writing a WordCount is the ability to do Aggregations
>> [1]. While the traditional Beam primitive is GroupByKey (and
>> CombineValues), we're eschewing KVs in the notion of more schema'd
>> data (which has some precedence in our other languages, see the links
>> below). The key components the user needs to specify are (1) the key
>> fields on which the grouping will take place, (2) the fields
>> (expressions?) involved in the aggregation, and (3) what aggregating
>> fn to use.
>>
>> A straw-man example could be something like
>>
>> type: Aggregating
>> config:
>>   key: [field1, field2]
>>   aggregating:
>> total_cost:
>>   fn: sum
>>   value: cost
>> max_cost:
>>   fn: max
>>   value: cost
>>
>> This would basically correspond to the SQL expression
>>
>> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
>> from table GROUP BY field1, field2"
>>
>> (though I'm not requiring that we use this as an implementation
>> strategy). I do not think we need a separate (non aggregating)
>> Grouping operation, this can be accomplished by having a concat-style
>> combiner.
>>
>> There are still some open questions here, notably around how to
>> specify the aggregation fns themselves. We could of course provide a
>> number of built-ins (like SQL does). This gets into the question of
>> how and where to document this complete set, but some basics should
>> take us pretty far. Many aggregators, however, are parameterized (e.g.
>> quantiles); where do we put the parameters? We could go with something
>> like
>>
>> fn:
>>   type: ApproximateQuantiles
>>   config:
>> n: 10
>>
>> but others are even configured by functions themselves (e.g. LargestN
>> that wants a comparator Fn). Maybe we decide not to support these
>> (yet?)
>>
>> One thing I think we should support, however, is referencing custom
>> CombineFns. We have some precedent for this with our Fns from
>> MapToFields, where we accept things like inline lambdas and external
>> references. Again the topic of how to configure them comes up, as
>> these custom Fns are more likely to be parameterized than Map Fns
>> (though, to be clear, perhaps it'd be good to allow parameterizatin of
>> MapFns as well). Maybe we allow
>>
>> language: python. # like MapToFields (and here it'd be harder to mix
>> and match per Fn)
>> fn:
>>   type: ???
>>   # should these be nested as config?
>>   name: fully.qualiied.name
>>   path: /path/to/defining/file
>>   args: [...]
>>   kwargs: {...}
>>
>> which would invoke the constructor.
>>
>> I'm also open to other ways of naming/structuring these essential
>> parameters if it makes things more clear.
>>
>> - Robert
>>
>>
>> Java: 
>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
>> Python: 
>> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
>> Typescript: 
>> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
>>
>> [1] One can of course use SqlTransform for this, but I'm leaning
>> towards offering something more native.


Re: [YAML] Aggregations

2023-10-18 Thread Byron Ellis via dev
Is it worth taking a look at similar prior art in the space? The first one
that comes to mind is Transform, but with the dbt labs acquisition that
spec is a lot harder to find. Rill
 is pretty similar though.

On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev 
wrote:

> Beam Yaml has good support for IOs and mappings, but one key missing
> feature for even writing a WordCount is the ability to do Aggregations
> [1]. While the traditional Beam primitive is GroupByKey (and
> CombineValues), we're eschewing KVs in the notion of more schema'd
> data (which has some precedence in our other languages, see the links
> below). The key components the user needs to specify are (1) the key
> fields on which the grouping will take place, (2) the fields
> (expressions?) involved in the aggregation, and (3) what aggregating
> fn to use.
>
> A straw-man example could be something like
>
> type: Aggregating
> config:
>   key: [field1, field2]
>   aggregating:
> total_cost:
>   fn: sum
>   value: cost
> max_cost:
>   fn: max
>   value: cost
>
> This would basically correspond to the SQL expression
>
> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> from table GROUP BY field1, field2"
>
> (though I'm not requiring that we use this as an implementation
> strategy). I do not think we need a separate (non aggregating)
> Grouping operation, this can be accomplished by having a concat-style
> combiner.
>
> There are still some open questions here, notably around how to
> specify the aggregation fns themselves. We could of course provide a
> number of built-ins (like SQL does). This gets into the question of
> how and where to document this complete set, but some basics should
> take us pretty far. Many aggregators, however, are parameterized (e.g.
> quantiles); where do we put the parameters? We could go with something
> like
>
> fn:
>   type: ApproximateQuantiles
>   config:
> n: 10
>
> but others are even configured by functions themselves (e.g. LargestN
> that wants a comparator Fn). Maybe we decide not to support these
> (yet?)
>
> One thing I think we should support, however, is referencing custom
> CombineFns. We have some precedent for this with our Fns from
> MapToFields, where we accept things like inline lambdas and external
> references. Again the topic of how to configure them comes up, as
> these custom Fns are more likely to be parameterized than Map Fns
> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> MapFns as well). Maybe we allow
>
> language: python. # like MapToFields (and here it'd be harder to mix
> and match per Fn)
> fn:
>   type: ???
>   # should these be nested as config?
>   name: fully.qualiied.name
>   path: /path/to/defining/file
>   args: [...]
>   kwargs: {...}
>
> which would invoke the constructor.
>
> I'm also open to other ways of naming/structuring these essential
> parameters if it makes things more clear.
>
> - Robert
>
>
> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
>
> [1] One can of course use SqlTransform for this, but I'm leaning
> towards offering something more native.
>


Re: [NOTICE] Deprecation Avro classes in "core" and use "extensions/avro" instead for Java SDK

2023-10-18 Thread Byron Ellis via dev
Awesome!

On Wed, Oct 18, 2023 at 1:14 PM Alexey Romanenko 
wrote:

> Heads up!
>
> Finally, all Avro-related code and Avro dependency, that was deprecated
> before (see a message above), has been removed from Beam Java SDK “core”
> module [1]. We believe that it was a sufficient number of Beam releases
> (six!) that passed after this code had been deprecated and users had an
> opportunity to switch to a new Avro extension as it was recommended before.
>
> We did our best to make this transition as smooth as possible but, please,
> let me know you find any failed tests or any other strange behavior because
> of this change.
>
> Thanks,
> Alexey
>
>
> [1] https://github.com/apache/beam/pull/27851/
>
>
> On 22 Feb 2023, at 20:21, Robert Bradshaw via dev 
> wrote:
>
> Thanks for pushing this through!
>
> On Wed, Feb 22, 2023 at 10:38 AM Alexey Romanenko
>  wrote:
>
>
> Hi all,
>
> As a part of migration the Avro-related classes from Java SDK “core”
> module to a dedicated extension [1] (as it was discussed here [2] and here
> [3]), two important PRs has been merged [4][5]. Therefore, old Avro-related
> classes became deprecated in “core” (still possible to use but not
> recommended) and all other Beam modules, that depended on them, switched to
> use "extensions/avro” instead.
>
> We did our best to make this change smooth, compatible and not breaking
> but, since it was one of the oldest part of “core”, then everything,
> unfortunatelly, is possible and we probably could miss something despite of
> all efforts. So, considering that, I’d like to ask community to run any
> kind of tests or pipelines that utilise, for example, AvroCoder or
> AvroUtils or any other related Avro classes and check if new changes
> doesn’t break something and everything works as expected.
>
> —
> Alexey
>
> [1] https://github.com/apache/beam/issues/24292
> [2] https://lists.apache.org/thread/mz8hvz8dwhd0tzmv2lyobhlz7gtg4gq7
> [3] https://lists.apache.org/thread/47oz1mlwj0orvo1458v5pw5c20bwt08q
> [4] https://github.com/apache/beam/pull/24992
> [5] https://github.com/apache/beam/pull/25534
>
>
>
>


Re: [NOTICE] Deprecation Avro classes in "core" and use "extensions/avro" instead for Java SDK

2023-10-18 Thread Alexey Romanenko
Heads up!

Finally, all Avro-related code and Avro dependency, that was deprecated before 
(see a message above), has been removed from Beam Java SDK “core” module [1]. 
We believe that it was a sufficient number of Beam releases (six!) that passed 
after this code had been deprecated and users had an opportunity to switch to a 
new Avro extension as it was recommended before.

We did our best to make this transition as smooth as possible but, please, let 
me know you find any failed tests or any other strange behavior because of this 
change.

Thanks,
Alexey


[1] https://github.com/apache/beam/pull/27851/


> On 22 Feb 2023, at 20:21, Robert Bradshaw via dev  wrote:
> 
> Thanks for pushing this through!
> 
> On Wed, Feb 22, 2023 at 10:38 AM Alexey Romanenko
>  wrote:
>> 
>> Hi all,
>> 
>> As a part of migration the Avro-related classes from Java SDK “core” module 
>> to a dedicated extension [1] (as it was discussed here [2] and here [3]), 
>> two important PRs has been merged [4][5]. Therefore, old Avro-related 
>> classes became deprecated in “core” (still possible to use but not 
>> recommended) and all other Beam modules, that depended on them, switched to 
>> use "extensions/avro” instead.
>> 
>> We did our best to make this change smooth, compatible and not breaking but, 
>> since it was one of the oldest part of “core”, then everything, 
>> unfortunatelly, is possible and we probably could miss something despite of 
>> all efforts. So, considering that, I’d like to ask community to run any kind 
>> of tests or pipelines that utilise, for example, AvroCoder or AvroUtils or 
>> any other related Avro classes and check if new changes doesn’t break 
>> something and everything works as expected.
>> 
>> —
>> Alexey
>> 
>> [1] https://github.com/apache/beam/issues/24292
>> [2] https://lists.apache.org/thread/mz8hvz8dwhd0tzmv2lyobhlz7gtg4gq7
>> [3] https://lists.apache.org/thread/47oz1mlwj0orvo1458v5pw5c20bwt08q
>> [4] https://github.com/apache/beam/pull/24992
>> [5] https://github.com/apache/beam/pull/25534
>> 
>> 



[YAML] Aggregations

2023-10-18 Thread Robert Bradshaw via dev
Beam Yaml has good support for IOs and mappings, but one key missing
feature for even writing a WordCount is the ability to do Aggregations
[1]. While the traditional Beam primitive is GroupByKey (and
CombineValues), we're eschewing KVs in the notion of more schema'd
data (which has some precedence in our other languages, see the links
below). The key components the user needs to specify are (1) the key
fields on which the grouping will take place, (2) the fields
(expressions?) involved in the aggregation, and (3) what aggregating
fn to use.

A straw-man example could be something like

type: Aggregating
config:
  key: [field1, field2]
  aggregating:
total_cost:
  fn: sum
  value: cost
max_cost:
  fn: max
  value: cost

This would basically correspond to the SQL expression

"SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
from table GROUP BY field1, field2"

(though I'm not requiring that we use this as an implementation
strategy). I do not think we need a separate (non aggregating)
Grouping operation, this can be accomplished by having a concat-style
combiner.

There are still some open questions here, notably around how to
specify the aggregation fns themselves. We could of course provide a
number of built-ins (like SQL does). This gets into the question of
how and where to document this complete set, but some basics should
take us pretty far. Many aggregators, however, are parameterized (e.g.
quantiles); where do we put the parameters? We could go with something
like

fn:
  type: ApproximateQuantiles
  config:
n: 10

but others are even configured by functions themselves (e.g. LargestN
that wants a comparator Fn). Maybe we decide not to support these
(yet?)

One thing I think we should support, however, is referencing custom
CombineFns. We have some precedent for this with our Fns from
MapToFields, where we accept things like inline lambdas and external
references. Again the topic of how to configure them comes up, as
these custom Fns are more likely to be parameterized than Map Fns
(though, to be clear, perhaps it'd be good to allow parameterizatin of
MapFns as well). Maybe we allow

language: python. # like MapToFields (and here it'd be harder to mix
and match per Fn)
fn:
  type: ???
  # should these be nested as config?
  name: fully.qualiied.name
  path: /path/to/defining/file
  args: [...]
  kwargs: {...}

which would invoke the constructor.

I'm also open to other ways of naming/structuring these essential
parameters if it makes things more clear.

- Robert


Java: 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
Python: 
https://beam.apache.org/documentation/transforms/python/aggregation/groupby
Typescript: 
https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html

[1] One can of course use SqlTransform for this, but I'm leaning
towards offering something more native.


Re: [Discuss] Idea to increase RC voting participation

2023-10-18 Thread Robert Bradshaw via dev
+1 That's a great idea. They have incentive to make sure the issue was
resolved for them, plus we get to ensure there were no other regressions.

On Wed, Oct 18, 2023 at 11:30 AM Johanna Öjeling via dev <
dev@beam.apache.org> wrote:

> When I have contributed to Apache Airflow, they have tagged all
> contributors concerned in a GitHub issue when the RC is available and asked
> us to validate it. Example: #29424
> .
>
> I found that to be an effective way to notify contributors of the RC and
> nudge them to help out. In the issue description there is a reference to
> the guidelines on how to test the RC and a note that people are encouraged
> to vote on the mailing list (which could admittedly be more highlighted
> because I did not pay attention to it until now and was unaware that
> contributors had a vote).
>
> It might be an idea to consider something similar here to increase the
> participation?
>
> On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> I'm +1 on helping explain what we mean by "validate the RC" since we're
>> really just asking users to see if their existing use cases work along with
>> our typical slate of tests. I don't know if offloading that work to our
>> active validators is the right approach though, documentation/screen share
>> of their specific workflow is definitely less useful than having a more
>> general outline of how to install the RC and things to look out for when
>> testing.
>>
>> On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett 
>> wrote:
>>
>>> Great effort.  I'm also interested in streamlining releases -- so if
>>> there are alot of manual tests that could be automated, would be great
>>> to discover and then look to address.
>>>
>>> On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1

 I would also strongly suggest that people try out the release against
 their own codebases. This has the benefit of ensuring the release won't
 break your own code when they go out, and stress-tests the new code against
 real-world pipelines. (Ideally our own tests are all passing, and this
 validation is automated as much as possible (though ensuring it matches our
 documentation and works in a clean environment still has value), but
 there's a lot of code and uses out there that we don't have access to
 during normal Beam development.)

 On Tue, Oct 17, 2023 at 8:21 AM Svetak Sundhar via dev <
 dev@beam.apache.org> wrote:

> Hi all,
>
> I’ve participated in RC testing for a few releases and have observed a
> bit of a knowledge gap in how releases can be tested. Given that Beam
> encourages contributors to vote on RC’s regardless of tenure, and that
> voting on an RC is a relatively low-effort, high leverage way to influence
> the release of the library, I propose the following:
>
> During the vote for the next release, voters can document the process
> they followed on a separate document, and add the link on column G
> here
> .
> One step further, could be a screencast of running the test, and attaching
> a link of that.
>
> We can keep repeating this through releases until we have
> documentation for many of the different tests. We can then add these docs
> into the repo.
>
> I’m proposing this because I’ve gathered the following feedback from
> colleagues that are tangentially involved with Beam: They are interested 
> in
> participating in release validation, but don’t know how to get started.
> Happy to hear other suggestions too, if there are any to address the
> above.
>
> Thanks,
>
>
> Svetak Sundhar
>
>   Data Engineer
> s vetaksund...@google.com
>
>


Re: [Discuss] Idea to increase RC voting participation

2023-10-18 Thread Johanna Öjeling via dev
When I have contributed to Apache Airflow, they have tagged all
contributors concerned in a GitHub issue when the RC is available and asked
us to validate it. Example: #29424
.

I found that to be an effective way to notify contributors of the RC and
nudge them to help out. In the issue description there is a reference to
the guidelines on how to test the RC and a note that people are encouraged
to vote on the mailing list (which could admittedly be more highlighted
because I did not pay attention to it until now and was unaware that
contributors had a vote).

It might be an idea to consider something similar here to increase the
participation?

On Tue, Oct 17, 2023 at 7:01 PM Jack McCluskey via dev 
wrote:

> I'm +1 on helping explain what we mean by "validate the RC" since we're
> really just asking users to see if their existing use cases work along with
> our typical slate of tests. I don't know if offloading that work to our
> active validators is the right approach though, documentation/screen share
> of their specific workflow is definitely less useful than having a more
> general outline of how to install the RC and things to look out for when
> testing.
>
> On Tue, Oct 17, 2023 at 12:55 PM Austin Bennett  wrote:
>
>> Great effort.  I'm also interested in streamlining releases -- so if
>> there are alot of manual tests that could be automated, would be great
>> to discover and then look to address.
>>
>> On Tue, Oct 17, 2023 at 8:47 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1
>>>
>>> I would also strongly suggest that people try out the release against
>>> their own codebases. This has the benefit of ensuring the release won't
>>> break your own code when they go out, and stress-tests the new code against
>>> real-world pipelines. (Ideally our own tests are all passing, and this
>>> validation is automated as much as possible (though ensuring it matches our
>>> documentation and works in a clean environment still has value), but
>>> there's a lot of code and uses out there that we don't have access to
>>> during normal Beam development.)
>>>
>>> On Tue, Oct 17, 2023 at 8:21 AM Svetak Sundhar via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hi all,

 I’ve participated in RC testing for a few releases and have observed a
 bit of a knowledge gap in how releases can be tested. Given that Beam
 encourages contributors to vote on RC’s regardless of tenure, and that
 voting on an RC is a relatively low-effort, high leverage way to influence
 the release of the library, I propose the following:

 During the vote for the next release, voters can document the process
 they followed on a separate document, and add the link on column G here
 .
 One step further, could be a screencast of running the test, and attaching
 a link of that.

 We can keep repeating this through releases until we have documentation
 for many of the different tests. We can then add these docs into the repo.

 I’m proposing this because I’ve gathered the following feedback from
 colleagues that are tangentially involved with Beam: They are interested in
 participating in release validation, but don’t know how to get started.
 Happy to hear other suggestions too, if there are any to address the
 above.

 Thanks,


 Svetak Sundhar

   Data Engineer
 s vetaksund...@google.com




Re: [PR] Publish docs for 2.51.0 release [beam-site]

2023-10-18 Thread via GitHub


kennknowles merged PR #649:
URL: https://github.com/apache/beam-site/pull/649


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[ANNOUNCE] Apache Beam 2.51.0 Released

2023-10-18 Thread Kenneth Knowles
The Apache Beam Team is pleased to announce the release of version 2.51.0.

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed
on the Beam Blog: https://beam.apache.org/blog/beam-2.51.0/ and the
Github release page
https://github.com/apache/beam/releases/tag/v2.51.0

Thanks to everyone who contributed to this release, and we hope you
enjoy using Beam 2.51.0.

Kenn, on behalf of the Apache Beam Team.


Beam 2.52.0 Release

2023-10-18 Thread Danny McCormick via dev
Hey everyone, the next release (2.52.0) branch cut is scheduled for Nov 1,
2023, 2 weeks from today, according to the release calendar [1]. I'd like
to perform this release; I will cut the branch on that date, and cherrypick
release-blocking fixes afterwards, if any.


Please help with the release by:


- Making sure that any unresolved release blocking issues have their
"Milestone" marked as "2.52.0 Release" as soon as possible.

- Reviewing the current release blockers [2] and remove the Milestone if
they don't meet the criteria at [3]. There are currently 12 release
blockers.


Let me know if you have any comments/objections/questions.


Thanks,

Danny


[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com

[2] https://github.com/apache/beam/milestone/16

[3] https://beam.apache.org/contribute/release-blocking/


Beam High Priority Issue Report (42)

2023-10-18 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github 
actions tests are failing due to update of pip 
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28703 [Failing Test]: Building a wheel 
for integration tests sometimes times out
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26981 [Bug]: Getting an error related to 
SchemaCoder after upgrading to 2.48
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
ht

Re: KafkaIO does not make use of Kafka Consumer Groups [kafka] [java] [io]

2023-10-18 Thread Jan Lukavský

Hi,

my two cents on this. While it would perfectly possible to use consumer 
group in KafkaIO, it has its own issues. The most visible would be, that 
using subscriptions might introduce unnecessary duplicates in downstream 
processing. The reason for this is that consumer in a consumer group 
might be reassigned partitions and/or being reset to a different offset 
based on conditions that are out of control of the consumer itself. This 
might lead to an inability to successfully commit offset of a bundle 
after it has been sent downstream, while the 
processed-but-not-yet-committed input element might be reprocessed by a 
different worker due to partition rebalance. This add unnecessary 
complexity with questionable benefits (observability of lag in a 
consumer group and possible automatic discovery of new partitions in a 
topic).


For these reasons I'd say, that it would be possible to introduce a 
different (e.g. KafkaConsumerGroupIO) IO, which could be added to Beam 
itself or (perhaps) some extension, but it makes little sense to 
introduce this into KafkaIO itself.


Hope this helps,

 Jan

On 10/18/23 05:49, Shaojie Wu wrote:

Can't agree with Shahar Frank more

On 2023/04/19 18:17:15 Shahar Frank wrote:

Hi Daniel,

I think I've already answered these in a previous email but let me answer
them again.

I was specifically responding to quoted points from your last email. I

really don't understand why you, as a user, care if the implementation of
the framework is using consumer groups or not as long as it has the
throughput you need and is correct. If there is something specific this
would be useful for, like monitoring or metrics, it seems a reasonable
feature request to me to ask to reflect the progress state in a kafka
consumer group, but not to use the underlying assignment mechanism for the
reasons stated above.


I do care for a couple of reasons:
1) Introducing risk with a technology that non one knows in the company vs.
a technology people know and trust (i.e. Kafka Consumer Groups)
2) A multitude of alerting, monitoring and other observability tools that
are using consumer groups will not be usable and new solutions would be
required
3) Expert knowhow on managing (and sometimes fixing) issues with Kafka in
the company will become useless - and this in turn introduces risk to
projects

If you want to run in a single machine application mode, you can try

setting the `flinkMaster` parameter to `[local]`, which should launch an
inline flink runner just for your pipeline. If you want to have a scaling
out cluster per-application, you can launch a repeatable flink cluster with
kubernetes on a per-application basis pretty easily.


I do agree that a Flink cluster is a great solution and have maintained a
few in my time.
Sadly in our use case I have to consider constraints set by security and
platform teams and that will take time.
By the time these follow through it is very likely that the decision to use
Beam would have fallen in favor of other solutions (i.e. Kafka Streams,
Camel and others) and this would be a shame in my view. It is very unlikely
that once taken this decision would be reversed for a long time.

Given that a Flink cluster is not an option for me at this point I have
been trying to push a solution where instances of a Beam pipeline are run
"disconnected" using a DIrectRunner or (better even) a FlinkRunner in local
standalone mode (as you're suggesting) - and like you suggested we are
running those using a K8s deployment which allows us to scale up a required.
The issue is if more than one pod attempts to run the pipeline - they will
not split the partitions between them but rather each would consume ALL
partitions and the output would include as many duplications as the number
of pods. This solution will then not be able to scale up horizontally.

That is exactly why I'm trying to suggest using consumer groups.
In this attempt I created - here

-  I've already shown it is possible (albeit I admit with limitations such
as you described) to use consumer groups and effectively allow our use case
to run on a scaled up K8s deployment of DirectRunners.

And again finally my question is why should Kafka be treated differently
from other messaging systems like SQS and PubSub for which it seems Beam
does not attempt to manage the distribution strategy as well the mechanism
for managing processed (committed) messages?

If Beam is able to perform as well with them managing these couldn't the
same be applied to Kafka?

Cheers,
Shahar.

--

Shahar Frank

srf...@gmail.com

+447799561438

--





On Wed, 19 Apr 2023 at 13:19, Daniel Collins  wrote:


Hello,

I was specifically responding to quoted points from your last email. I
really don't understand why you, as a user, care if the implementation of
the framework is using consumer groups or no