date:20240208

Re: Playground: File Explorer?

2024-02-08 Thread Robert Burke

I think in principle we could update the release process to do that, but it
would require adjusting how we build the staging version of the playground
to accommodate how each SDK handles RCs.

At present it's very geared towards building from released versions.

On Thu, Feb 8, 2024, 9:18 AM Joey Tran  wrote:

> Ah that makes sense. Does the new version of Playground get staged for
> release validation?
>
> On Thu, Feb 8, 2024 at 12:08 PM Robert Burke  wrote:
>
>> We redeploy the playground along with the release, so once 2.54.0 RC2 has
>> been validated and voted on, I'll be redeploying it with 2.54.0.
>>
>> On Thu, Feb 8, 2024, 7:18 AM Joey Tran  wrote:
>>
>>> Here's two:
>>>
>>> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python
>>> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python
>>>
>>> Also, how often does playground get redeployed? I put up a PR[1] that's
>>> been merged for try to reduce the amount of logging these examples produce
>>> and I'm not sure if it's not working or if playground just hasn't been
>>> redeployed in the last month or so
>>>
>>> [1] https://github.com/apache/beam/pull/29948
>>>
>>>
>>> On Thu, Feb 8, 2024 at 10:12 AM XQ Hu via dev 
>>> wrote:
>>>
 Can you provide which example you are referring to? I checked a few
 examples and usually we use beam.Map(print) to display some output values.

 On Wed, Feb 7, 2024 at 8:55 PM Joey Tran 
 wrote:

> Hey all,
>
> I've been really trying to use Playground for educating new Beam users
> but it feels like there's something missing. A lot of examples (e.g.
> Multiple ParDo Outputs) for at least the python API don't seem to do
> anything observable. For example, the Multiple ParDo Outputs example 
> writes
> to a file but is there any way to actually look at written out files? I
> feel like maybe I'm missing something.
>
> Best,
> Joey
>

Re: [VOTE] Release 2.54.0, release candidate #2

2024-02-08 Thread Svetak Sundhar via dev

+1 (Non-Binding)

Tested with Python SDK on DirectRunner and Dataflow Runner


Svetak Sundhar

  Data Engineer
s vetaksund...@google.com



On Thu, Feb 8, 2024 at 12:45 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> +1 (binding)
>
> Tried out Java/Python multi-lang jobs and upgrading BQ/Kafka transforms
> from 2.53.0 to 2.54.0 using the Transform Service.
>
> Thanks,
> Cham
>
> On Wed, Feb 7, 2024 at 5:52 PM XQ Hu via dev  wrote:
>
>> +1 (non-binding)
>>
>> Validated with a simple RunInference Python pipeline:
>> https://github.com/google/dataflow-ml-starter/actions/runs/7821639833/job/21339032997
>>
>> On Wed, Feb 7, 2024 at 7:10 PM Yi Hu via dev  wrote:
>>
>>> +1 (non-binding)
>>>
>>> Validated with Dataflow Template:
>>> https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/1317
>>>
>>> Regards,
>>>
>>> On Wed, Feb 7, 2024 at 11:18 AM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1 (non-binding)

 Ran a few batch and streaming examples for Python SDK on Dataflow Runner

 Thanks!

 On Wed, Feb 7, 2024 at 4:08 AM Jan Lukavský  wrote:

> +1 (binding)
>
> Validated Java SDK with Flink runner.
>
>  Jan
> On 2/7/24 06:23, Robert Burke via dev wrote:
>
> Hi everyone,
> Please review and vote on the release candidate #2 for the version
> 2.54.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if
> no issues are found. Only PMC member votes will count towards the final
> vote, but votes from all
> community members is encouraged and helpful for finding regressions;
> you
> can either test your own
> use cases [13] or use cases from the validation sheet [10].
>
> The complete staging area is available for your review, which includes:
> * GitHub Release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2],
> which is signed with the key with fingerprint D20316F712213422 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.54.0-RC2" [5],
> * website pull request listing the release [6], the blog post [6], and
> publishing the API reference manual [7].
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2] and PyPI[8].
> * Go artifacts and documentation are available at pkg.go.dev [9]
> * Validation sheet with a tab for 2.54.0 release to help with
> validation
> [10].
> * Docker images published to Docker Hub [11].
> * PR to run tests against release branch [12].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> For guidelines on how to try the release in your projects, check out
> our RC
> testing guide [13].
>
> Thanks,
> Robert Burke
> Beam 2.54.0 Release Manager
>
> [1] https://github.com/apache/beam/milestone/18?closed=1
> [2] https://dist.apache.org/repos/dist/dev/beam/2.54.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1368/
> [5] https://github.com/apache/beam/tree/v2.54.0-RC2
> [6] https://github.com/apache/beam/pull/30201
> [7] https://github.com/apache/beam-site/pull/659
> [8] https://pypi.org/project/apache-beam/2.54.0rc2/
> [9]
>
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.54.0-RC2/go/pkg/beam
> [10]
>
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=28763708
> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
> [12] https://github.com/apache/beam/pull/30104
> [13]
>
> https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md
>
>

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-02-08 Thread Robert Burke

Was that only October? Wow.

Option 2 SGTM, with the adjustment to making the core of the URN
"redistribute_allowing_duplicates" instead of building from the unspecified
Reshuffle semantics.

Transforms getting updated to use the new transform can have their
@RequiresStableInputs annotation added  accordingly if they need that
property per previous discussions.



On Thu, Feb 8, 2024, 10:31 AM Kenneth Knowles  wrote:

>
>
> On Wed, Feb 7, 2024 at 5:15 PM Robert Burke  wrote:
>
>> OK, so my stance is a configurable Reshuffle might be interesting, so my
>> vote is +1, along the following lines.
>>
>> 1. Use a new URN (beam:transform:reshuffle:v2) and attach a new
>> ReshufflePayload to it.
>>
>
> Ah, I see there's more than one variation of the "new URN" approach.
> Namely, you have a new version of an existing URN prefix, while I had in
> mind that it was a totally new base URN. In other words the open question I
> meant to pose is between these options:
>
> 1. beam:transform:reshuffle:v2 + { allowing_duplicates: true }
> 2. beam:transform:reshuffle_allowing_duplicates:v1 {}
>
> The most compelling argument in favor of option 2 is that it could have a
> distinct payload type associated with the different URN (maybe parameters
> around tweaking how much duplication? I don't know... I actually expect
> neither payload to evolve much if at all).
>
> There were also two comments in favor of option 2 on the design doc.
>
>   -> Unknown "urns for composite transforms" already default to the
>> subtransform graph implementation for most (all?) runners.
>>   -> Having a payload to toggle this behavior then can have whatever
>> desired behavior we like. It also allows for additional configurations
>> added in later on. This is preferable to a plethora of one-off urns IMHO.
>> We can have SDKs gate configuration combinations as needed if additional
>> ones appear.
>>
>> 2. It's very cheap to add but also ignore, as the default is "Do what
>> we're already doing without change", and not all SDKs need to add it right
>> away. It's more important that the portable way is defined at least, so
>> it's easy for other SDKs to add and handle it.
>>
>> I would prefer we have a clear starting point on what Reshuffle does
>> though. I remain a fan of "The Reshuffle (v2) Transform is a user
>> designated hint to a runner for a change in parallelism. By default, it
>> produces an output PCollection that has the same elements as the input
>> PCollection".
>>
>
> +1 this is a better phrasing of the spec I propose in
> https://s.apache.org/beam-redistribute but let's not get into it here if
> we can, and just evaluate the delta from that design to
> https://s.apache.org/beam-reshuffle-allowing-duplicates
>
> Kenn
>
>
>> It remains an open question about what that means for
>> checkpointing/durability behavior, but that's largely been runner dependent
>> anyway. I admit the above definition is biased by the uses of Reshuffle I'm
>> aware of, which largely are to incur a fusion break in the execution graph.
>>
>> Robert Burke
>> Beam Go Busybody
>>
>> On 2024/01/31 16:01:33 Kenneth Knowles wrote:
>> > On Wed, Jan 31, 2024 at 4:21 AM Jan Lukavský  wrote:
>> >
>> > > Hi,
>> > >
>> > > if I understand this proposal correctly, the motivation is actually
>> > > reducing latency by bypassing bundle atomic guarantees, bundles after
>> "at
>> > > least once" Reshuffle would be reconstructed independently of the
>> > > pre-shuffle bundling. Provided this is correct, it seems that the
>> behavior
>> > > is slightly more general than for the case of Reshuffle. We have
>> already
>> > > some transforms that manipulate a specific property of a PCollection
>> - if
>> > > it may or might not contain duplicates. That is manipulated in two
>> ways -
>> > > explicitly removing duplicates based on IDs on sources that generate
>> > > duplicates and using @RequiresStableInput, mostly in sinks. These
>> > > techniques modify an inherent property of a PCollection, that is if it
>> > > contains or does not contain possible duplicates originating from the
>> same
>> > > input element.
>> > >
>> > > There are two types of duplicates - duplicate elements in _different
>> > > bundles_ (typically from at-least-once sources) and duplicates
>> arising due
>> > > to bundle reprocessing (affecting only transforms with side-effects,
>> that
>> > > is what we solve by @RequiresStableInput). The point I'm trying to
>> get to -
>> > > should we add these properties to PCollections (contains cross-bundle
>> > > duplicates vs. does not) and PTransforms ("outputs deduplicated
>> elements"
>> > > and "requires stable input")? That would allow us to analyze the
>> Pipeline
>> > > DAG and provide appropriate implementation for Reshuffle
>> automatically, so
>> > > that a new URN or flag would not be needed. Moreover, this might be
>> useful
>> > > for a broader range of optimizations.
>> > >
>> > > WDYT?
>> > >
>> > These are interesting ideas that could be useful. I think

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-02-08 Thread Kenneth Knowles

On Wed, Feb 7, 2024 at 5:15 PM Robert Burke  wrote:

> OK, so my stance is a configurable Reshuffle might be interesting, so my
> vote is +1, along the following lines.
>
> 1. Use a new URN (beam:transform:reshuffle:v2) and attach a new
> ReshufflePayload to it.
>

Ah, I see there's more than one variation of the "new URN" approach.
Namely, you have a new version of an existing URN prefix, while I had in
mind that it was a totally new base URN. In other words the open question I
meant to pose is between these options:

1. beam:transform:reshuffle:v2 + { allowing_duplicates: true }
2. beam:transform:reshuffle_allowing_duplicates:v1 {}

The most compelling argument in favor of option 2 is that it could have a
distinct payload type associated with the different URN (maybe parameters
around tweaking how much duplication? I don't know... I actually expect
neither payload to evolve much if at all).

There were also two comments in favor of option 2 on the design doc.

  -> Unknown "urns for composite transforms" already default to the
> subtransform graph implementation for most (all?) runners.
>   -> Having a payload to toggle this behavior then can have whatever
> desired behavior we like. It also allows for additional configurations
> added in later on. This is preferable to a plethora of one-off urns IMHO.
> We can have SDKs gate configuration combinations as needed if additional
> ones appear.
>
> 2. It's very cheap to add but also ignore, as the default is "Do what
> we're already doing without change", and not all SDKs need to add it right
> away. It's more important that the portable way is defined at least, so
> it's easy for other SDKs to add and handle it.
>
> I would prefer we have a clear starting point on what Reshuffle does
> though. I remain a fan of "The Reshuffle (v2) Transform is a user
> designated hint to a runner for a change in parallelism. By default, it
> produces an output PCollection that has the same elements as the input
> PCollection".
>

+1 this is a better phrasing of the spec I propose in
https://s.apache.org/beam-redistribute but let's not get into it here if we
can, and just evaluate the delta from that design to
https://s.apache.org/beam-reshuffle-allowing-duplicates

Kenn


> It remains an open question about what that means for
> checkpointing/durability behavior, but that's largely been runner dependent
> anyway. I admit the above definition is biased by the uses of Reshuffle I'm
> aware of, which largely are to incur a fusion break in the execution graph.
>
> Robert Burke
> Beam Go Busybody
>
> On 2024/01/31 16:01:33 Kenneth Knowles wrote:
> > On Wed, Jan 31, 2024 at 4:21 AM Jan Lukavský  wrote:
> >
> > > Hi,
> > >
> > > if I understand this proposal correctly, the motivation is actually
> > > reducing latency by bypassing bundle atomic guarantees, bundles after
> "at
> > > least once" Reshuffle would be reconstructed independently of the
> > > pre-shuffle bundling. Provided this is correct, it seems that the
> behavior
> > > is slightly more general than for the case of Reshuffle. We have
> already
> > > some transforms that manipulate a specific property of a PCollection -
> if
> > > it may or might not contain duplicates. That is manipulated in two
> ways -
> > > explicitly removing duplicates based on IDs on sources that generate
> > > duplicates and using @RequiresStableInput, mostly in sinks. These
> > > techniques modify an inherent property of a PCollection, that is if it
> > > contains or does not contain possible duplicates originating from the
> same
> > > input element.
> > >
> > > There are two types of duplicates - duplicate elements in _different
> > > bundles_ (typically from at-least-once sources) and duplicates arising
> due
> > > to bundle reprocessing (affecting only transforms with side-effects,
> that
> > > is what we solve by @RequiresStableInput). The point I'm trying to get
> to -
> > > should we add these properties to PCollections (contains cross-bundle
> > > duplicates vs. does not) and PTransforms ("outputs deduplicated
> elements"
> > > and "requires stable input")? That would allow us to analyze the
> Pipeline
> > > DAG and provide appropriate implementation for Reshuffle
> automatically, so
> > > that a new URN or flag would not be needed. Moreover, this might be
> useful
> > > for a broader range of optimizations.
> > >
> > > WDYT?
> > >
> > These are interesting ideas that could be useful. I think they achieve a
> > different goal in my case. I actually want to explicitly allow
> > Reshuffle.allowingDuplicates() to skip expensive parts of its
> > implementation that are used to prevent duplicates.
> >
> > The property that would make it possible to automate this in the case of
> > combiners, or at least validate that the pipeline still gives 100%
> accurate
> > answers, would be something like @InsensitiveToDuplicateElements which is
> > longer and less esoteric than @Idempotent. For situations where there is
> a
> > source or sink that

Re: [VOTE] Release 2.54.0, release candidate #2

2024-02-08 Thread Chamikara Jayalath via dev

+1 (binding)

Tried out Java/Python multi-lang jobs and upgrading BQ/Kafka transforms
from 2.53.0 to 2.54.0 using the Transform Service.

Thanks,
Cham

On Wed, Feb 7, 2024 at 5:52 PM XQ Hu via dev  wrote:

> +1 (non-binding)
>
> Validated with a simple RunInference Python pipeline:
> https://github.com/google/dataflow-ml-starter/actions/runs/7821639833/job/21339032997
>
> On Wed, Feb 7, 2024 at 7:10 PM Yi Hu via dev  wrote:
>
>> +1 (non-binding)
>>
>> Validated with Dataflow Template:
>> https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/1317
>>
>> Regards,
>>
>> On Wed, Feb 7, 2024 at 11:18 AM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Ran a few batch and streaming examples for Python SDK on Dataflow Runner
>>>
>>> Thanks!
>>>
>>> On Wed, Feb 7, 2024 at 4:08 AM Jan Lukavský  wrote:
>>>
 +1 (binding)

 Validated Java SDK with Flink runner.

  Jan
 On 2/7/24 06:23, Robert Burke via dev wrote:

 Hi everyone,
 Please review and vote on the release candidate #2 for the version
 2.54.0,
 as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 Reviewers are encouraged to test their own use cases with the release
 candidate, and vote +1 if
 no issues are found. Only PMC member votes will count towards the final
 vote, but votes from all
 community members is encouraged and helpful for finding regressions; you
 can either test your own
 use cases [13] or use cases from the validation sheet [10].

 The complete staging area is available for your review, which includes:
 * GitHub Release notes [1],
 * the official Apache source release to be deployed to dist.apache.org
 [2],
 which is signed with the key with fingerprint D20316F712213422 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.54.0-RC2" [5],
 * website pull request listing the release [6], the blog post [6], and
 publishing the API reference manual [7].
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2] and PyPI[8].
 * Go artifacts and documentation are available at pkg.go.dev [9]
 * Validation sheet with a tab for 2.54.0 release to help with validation
 [10].
 * Docker images published to Docker Hub [11].
 * PR to run tests against release branch [12].

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 For guidelines on how to try the release in your projects, check out
 our RC
 testing guide [13].

 Thanks,
 Robert Burke
 Beam 2.54.0 Release Manager

 [1] https://github.com/apache/beam/milestone/18?closed=1
 [2] https://dist.apache.org/repos/dist/dev/beam/2.54.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1368/
 [5] https://github.com/apache/beam/tree/v2.54.0-RC2
 [6] https://github.com/apache/beam/pull/30201
 [7] https://github.com/apache/beam-site/pull/659
 [8] https://pypi.org/project/apache-beam/2.54.0rc2/
 [9]

 https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.54.0-RC2/go/pkg/beam
 [10]

 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=28763708
 [11] https://hub.docker.com/search?q=apache%2Fbeam=image
 [12] https://github.com/apache/beam/pull/30104
 [13]

 https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md

Re: Playground: File Explorer?

2024-02-08 Thread Joey Tran

Ah that makes sense. Does the new version of Playground get staged for
release validation?

On Thu, Feb 8, 2024 at 12:08 PM Robert Burke  wrote:

> We redeploy the playground along with the release, so once 2.54.0 RC2 has
> been validated and voted on, I'll be redeploying it with 2.54.0.
>
> On Thu, Feb 8, 2024, 7:18 AM Joey Tran  wrote:
>
>> Here's two:
>>
>> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python
>> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python
>>
>> Also, how often does playground get redeployed? I put up a PR[1] that's
>> been merged for try to reduce the amount of logging these examples produce
>> and I'm not sure if it's not working or if playground just hasn't been
>> redeployed in the last month or so
>>
>> [1] https://github.com/apache/beam/pull/29948
>>
>>
>> On Thu, Feb 8, 2024 at 10:12 AM XQ Hu via dev 
>> wrote:
>>
>>> Can you provide which example you are referring to? I checked a few
>>> examples and usually we use beam.Map(print) to display some output values.
>>>
>>> On Wed, Feb 7, 2024 at 8:55 PM Joey Tran 
>>> wrote:
>>>
 Hey all,

 I've been really trying to use Playground for educating new Beam users
 but it feels like there's something missing. A lot of examples (e.g.
 Multiple ParDo Outputs) for at least the python API don't seem to do
 anything observable. For example, the Multiple ParDo Outputs example writes
 to a file but is there any way to actually look at written out files? I
 feel like maybe I'm missing something.

 Best,
 Joey

>>>

Re: Playground: File Explorer?

2024-02-08 Thread Robert Burke

We redeploy the playground along with the release, so once 2.54.0 RC2 has
been validated and voted on, I'll be redeploying it with 2.54.0.

On Thu, Feb 8, 2024, 7:18 AM Joey Tran  wrote:

> Here's two:
>
> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python
> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python
>
> Also, how often does playground get redeployed? I put up a PR[1] that's
> been merged for try to reduce the amount of logging these examples produce
> and I'm not sure if it's not working or if playground just hasn't been
> redeployed in the last month or so
>
> [1] https://github.com/apache/beam/pull/29948
>
>
> On Thu, Feb 8, 2024 at 10:12 AM XQ Hu via dev  wrote:
>
>> Can you provide which example you are referring to? I checked a few
>> examples and usually we use beam.Map(print) to display some output values.
>>
>> On Wed, Feb 7, 2024 at 8:55 PM Joey Tran 
>> wrote:
>>
>>> Hey all,
>>>
>>> I've been really trying to use Playground for educating new Beam users
>>> but it feels like there's something missing. A lot of examples (e.g.
>>> Multiple ParDo Outputs) for at least the python API don't seem to do
>>> anything observable. For example, the Multiple ParDo Outputs example writes
>>> to a file but is there any way to actually look at written out files? I
>>> feel like maybe I'm missing something.
>>>
>>> Best,
>>> Joey
>>>
>>

Re: Playground: File Explorer?

2024-02-08 Thread Joey Tran

Here's two:
https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python
https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python

Also, how often does playground get redeployed? I put up a PR[1] that's
been merged for try to reduce the amount of logging these examples produce
and I'm not sure if it's not working or if playground just hasn't been
redeployed in the last month or so

[1] https://github.com/apache/beam/pull/29948


On Thu, Feb 8, 2024 at 10:12 AM XQ Hu via dev  wrote:

> Can you provide which example you are referring to? I checked a few
> examples and usually we use beam.Map(print) to display some output values.
>
> On Wed, Feb 7, 2024 at 8:55 PM Joey Tran 
> wrote:
>
>> Hey all,
>>
>> I've been really trying to use Playground for educating new Beam users
>> but it feels like there's something missing. A lot of examples (e.g.
>> Multiple ParDo Outputs) for at least the python API don't seem to do
>> anything observable. For example, the Multiple ParDo Outputs example writes
>> to a file but is there any way to actually look at written out files? I
>> feel like maybe I'm missing something.
>>
>> Best,
>> Joey
>>
>

Re: Playground: File Explorer?

2024-02-08 Thread XQ Hu via dev

Can you provide which example you are referring to? I checked a few
examples and usually we use beam.Map(print) to display some output values.

On Wed, Feb 7, 2024 at 8:55 PM Joey Tran  wrote:

> Hey all,
>
> I've been really trying to use Playground for educating new Beam users but
> it feels like there's something missing. A lot of examples (e.g. Multiple
> ParDo Outputs) for at least the python API don't seem to do anything
> observable. For example, the Multiple ParDo Outputs example writes to a
> file but is there any way to actually look at written out files? I feel
> like maybe I'm missing something.
>
> Best,
> Joey
>

Beam High Priority Issue Report (39)

2024-02-08 Thread beamactions

This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/29971 [Bug]: FixedWindows not working for 
large Kafka topic
https://github.com/apache/beam/issues/29926 [Bug]: FileIO: lack of timeouts may 
cause the pipeline to get stuck indefinitely
https://github.com/apache/beam/issues/29902 [Bug]: Messages are not ACK on 
Pubsub starting Beam 2.52.0 on Flink Runner in detached mode
https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness 
doesn't update user counters in OnTimer callback functions
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/29515 [Bug]: WriteToFiles in python leave 
few records in

Re: Playground: File Explorer?

Re: [VOTE] Release 2.54.0, release candidate #2

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

Re: [VOTE] Release 2.54.0, release candidate #2

Re: Playground: File Explorer?

Re: Playground: File Explorer?

Re: Playground: File Explorer?

Re: Playground: File Explorer?

Beam High Priority Issue Report (39)

10 matches

Site Navigation

Mail list logo

Footer information