Fwd: Community over Code EU 2024 Travel Assistance Applications now open!
FYI. -- Forwarded message - The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! TAC will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists to help those that would like to attend Community over Code events, but are unable to do so for financial reasons. For more info on this years applications and qualifying criteria, please visit the TAC website at < https://tac.apache.org/ >. Applications are already open on https://tac-apply.apache.org/, so don't delay! The Apache Travel Assistance Committee will only be accepting applications from those people that are able to attend the full event. Important: Applications close on Friday, March 1st, 2024. Applicants have until the the closing date above to submit their applications (which should contain as much supporting material as required to efficiently and accurately process their request), this will enable TAC to announce successful applications shortly afterwards. As usual, TAC expects to deal with a range of applications from a diverse range of backgrounds; therefore, we encourage (as always) anyone thinking about sending in an application to do so ASAP. When replying, please reply to travel-assista...@apache.org
Fwd: [Important] GSoC 2024 Project Ideas
Hi all! GSoC is a good opportunity to get new community members and a good opportunity for you to mentor junior folks. If you have any ideas please record them (https://s.apache.org/gsoc2024ideas) *by 3rd Feb*. As for some ideas, these are good opportunities for (i) some low hanging but low priority projects: usability, infra or website improvements, some new IOs or IO related features (ii) good for trying prototyping out stuff that we have not spend much time yet: ray on beam (iii) long awaiting projects that are well known but not prioritized: fixing known runner issues, running some benchmarks. Feel free to share this information with other relevant folks. Ahmet -- Forwarded message - From: Priya Sharma Date: Thu, Jan 25, 2024 at 4:21 AM Subject: [Important] GSoC 2024 Project Ideas To: , Swapnil M Mane , Sanyam Goel , Maxim Solodovnik Hello PMCs, Google Summer of Code is the ideal opportunity for you to attract new contributors to your projects and GSoC 2024 is here. The ASF will be applying as a participating organization for GSoC 2024. As a part of the application we need you all to *mandatorily* start recording your ideas now [1] latest by 3rd Feb. There is slight change in the rules this year, just reiterating here: - For the 2024 program, there will be three options for project scope: medium at ~175 hours, large at ~350 hours and a new size: small at ~90 hours. Please add "*full-time*" label to the JIRA for 350 hour project , "*part-time*" label for 175 hours project and “*small*” for a 90 hour project. Note: They are looking to bring more open source projects in the AI/ML field into GSoC 2024, so we encourage more projects from this domain to participate. If you are a new mentor or your project is participating for the first time, please read [2][3]. On behalf of the GSoC 2024 admins, Please feel free to reach out to us in case of queries or concerns. [1] https://s.apache.org/gsoc2024ideas [2] https://community.apache.org/gsoc.html [3] https://community.apache.org/guide-to-being-a-mentor.html
Re: [python] Why CombinePerKey(lambda vs: None)?
Ah! I understand now. Both GroupByKey _and_ CombineValues are joined together and both run locally on the worker first. I forgot that GroupByKey is also 'lifted'. So we groupbykey locally and drop the extraneous None's locally so we don't need to unnecessarily communicate them back to the full groupbykey. Thanks for the clarification! On Fri, Jan 26, 2024 at 12:03 PM Robert Bradshaw wrote: > On Fri, Jan 26, 2024 at 8:43 AM Joey Tran > wrote: > > > > Hmm, I think I might still be missing something. CombinePerKey is made > up of "GBK() | CombineValues". Pulling it out into the Distinct, Distinct > looks like: > > > > def Distinct(pcoll): # pylint: disable=invalid-name > > """Produces a PCollection containing distinct elements of a > PCollection.""" > > return ( > > pcoll > > | 'ToPairs' >> Map(lambda v: (v, None)) > > | 'Group' >> GroupByKey() > > | 'CombineValues >> CombineValues(lambda vs: None) > > | 'Distinct' >> Keys()) > > > > Does the combiner lifting somehow make the GroupByKey operation more > efficient despite coming after it? My intuition would suggest that we could > just remove the `CombineValues` altogether > > The key property of CombineFns is that they are commutative and > associative which permits an optimization called combiner lifting. > Specifically, the operation > > GroupByKey() | CombineValues(C) > > re-written into > > PartialCombineUsingLocalBufferMap(C) | GroupByKey() | FinalCombine(C) > > that pretty much every runner supports (going back to the days of the > original MapReduce), which is what can make this so much more > efficient. > > > https://github.com/apache/beam/blob/release-2.21.0/sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py#L669 > > I am, unfortunately, coming up short in finding good documentation on > this (Apache Beam specific or otherwise). > > > > On Fri, Jan 26, 2024 at 11:33 AM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> > >> This is because it allows us to do some of the deduplication before > >> shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one > >> worker and [B, B, B, B, C, C] on another. Rather than passing all that > >> data through the GroupByKey (which involves (relatively) expensive > >> materialization and cross-machine traffic, with this form the first > >> worker will only emit [A, B] and the second [B, C] and only the B > >> needs to be deduplicated post-shuffle. > >> > >> Wouldn't hurt to have a comment to that effect there. > >> > >> https://beam.apache.org/documentation/programming-guide/#combine > >> > >> On Fri, Jan 26, 2024 at 8:22 AM Joey Tran > wrote: > >> > > >> > Hey all, > >> > > >> > I was poking around and looking at `Distinct` and was confused about > why it was implemented the way it was. > >> > > >> > Reproduced here: > >> > @ptransform_fn > >> > @typehints.with_input_types(T) > >> > @typehints.with_output_types(T) > >> > def Distinct(pcoll): # pylint: disable=invalid-name > >> > """Produces a PCollection containing distinct elements of a > PCollection.""" > >> > return ( > >> > pcoll > >> > | 'ToPairs' >> Map(lambda v: (v, None)) > >> > | 'Group' >> CombinePerKey(lambda vs: None) > >> > | 'Distinct' >> Keys()) > >> > > >> > Could anyone clarify why we'd use a `CombinePerKey` instead of just > using `GroupByKey`? > >> > > >> > Cheers, > >> > Joey >
Re: [python] Why CombinePerKey(lambda vs: None)?
On Fri, Jan 26, 2024 at 8:43 AM Joey Tran wrote: > > Hmm, I think I might still be missing something. CombinePerKey is made up of > "GBK() | CombineValues". Pulling it out into the Distinct, Distinct looks > like: > > def Distinct(pcoll): # pylint: disable=invalid-name > """Produces a PCollection containing distinct elements of a PCollection.""" > return ( > pcoll > | 'ToPairs' >> Map(lambda v: (v, None)) > | 'Group' >> GroupByKey() > | 'CombineValues >> CombineValues(lambda vs: None) > | 'Distinct' >> Keys()) > > Does the combiner lifting somehow make the GroupByKey operation more > efficient despite coming after it? My intuition would suggest that we could > just remove the `CombineValues` altogether The key property of CombineFns is that they are commutative and associative which permits an optimization called combiner lifting. Specifically, the operation GroupByKey() | CombineValues(C) re-written into PartialCombineUsingLocalBufferMap(C) | GroupByKey() | FinalCombine(C) that pretty much every runner supports (going back to the days of the original MapReduce), which is what can make this so much more efficient. https://github.com/apache/beam/blob/release-2.21.0/sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py#L669 I am, unfortunately, coming up short in finding good documentation on this (Apache Beam specific or otherwise). > On Fri, Jan 26, 2024 at 11:33 AM Robert Bradshaw via dev > wrote: >> >> This is because it allows us to do some of the deduplication before >> shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one >> worker and [B, B, B, B, C, C] on another. Rather than passing all that >> data through the GroupByKey (which involves (relatively) expensive >> materialization and cross-machine traffic, with this form the first >> worker will only emit [A, B] and the second [B, C] and only the B >> needs to be deduplicated post-shuffle. >> >> Wouldn't hurt to have a comment to that effect there. >> >> https://beam.apache.org/documentation/programming-guide/#combine >> >> On Fri, Jan 26, 2024 at 8:22 AM Joey Tran wrote: >> > >> > Hey all, >> > >> > I was poking around and looking at `Distinct` and was confused about why >> > it was implemented the way it was. >> > >> > Reproduced here: >> > @ptransform_fn >> > @typehints.with_input_types(T) >> > @typehints.with_output_types(T) >> > def Distinct(pcoll): # pylint: disable=invalid-name >> > """Produces a PCollection containing distinct elements of a >> > PCollection.""" >> > return ( >> > pcoll >> > | 'ToPairs' >> Map(lambda v: (v, None)) >> > | 'Group' >> CombinePerKey(lambda vs: None) >> > | 'Distinct' >> Keys()) >> > >> > Could anyone clarify why we'd use a `CombinePerKey` instead of just using >> > `GroupByKey`? >> > >> > Cheers, >> > Joey
Re: [python] Why CombinePerKey(lambda vs: None)?
Hmm, I think I might still be missing something. CombinePerKey is made up of "GBK() | CombineValues". Pulling it out into the Distinct, Distinct looks like: def Distinct(pcoll): # pylint: disable=invalid-name """Produces a PCollection containing distinct elements of a PCollection.""" return ( pcoll | 'ToPairs' >> Map(lambda v: (v, None)) | 'Group' >> GroupByKey() | 'CombineValues >> CombineValues(lambda vs: None) | 'Distinct' >> Keys()) Does the combiner lifting somehow make the GroupByKey operation more efficient despite coming after it? My intuition would suggest that we could just remove the `CombineValues` altogether On Fri, Jan 26, 2024 at 11:33 AM Robert Bradshaw via dev < dev@beam.apache.org> wrote: > This is because it allows us to do some of the deduplication before > shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one > worker and [B, B, B, B, C, C] on another. Rather than passing all that > data through the GroupByKey (which involves (relatively) expensive > materialization and cross-machine traffic, with this form the first > worker will only emit [A, B] and the second [B, C] and only the B > needs to be deduplicated post-shuffle. > > Wouldn't hurt to have a comment to that effect there. > > https://beam.apache.org/documentation/programming-guide/#combine > > On Fri, Jan 26, 2024 at 8:22 AM Joey Tran > wrote: > > > > Hey all, > > > > I was poking around and looking at `Distinct` and was confused about why > it was implemented the way it was. > > > > Reproduced here: > > @ptransform_fn > > @typehints.with_input_types(T) > > @typehints.with_output_types(T) > > def Distinct(pcoll): # pylint: disable=invalid-name > > """Produces a PCollection containing distinct elements of a > PCollection.""" > > return ( > > pcoll > > | 'ToPairs' >> Map(lambda v: (v, None)) > > | 'Group' >> CombinePerKey(lambda vs: None) > > | 'Distinct' >> Keys()) > > > > Could anyone clarify why we'd use a `CombinePerKey` instead of just > using `GroupByKey`? > > > > Cheers, > > Joey >
Re: [python] Why CombinePerKey(lambda vs: None)?
This is because it allows us to do some of the deduplication before shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one worker and [B, B, B, B, C, C] on another. Rather than passing all that data through the GroupByKey (which involves (relatively) expensive materialization and cross-machine traffic, with this form the first worker will only emit [A, B] and the second [B, C] and only the B needs to be deduplicated post-shuffle. Wouldn't hurt to have a comment to that effect there. https://beam.apache.org/documentation/programming-guide/#combine On Fri, Jan 26, 2024 at 8:22 AM Joey Tran wrote: > > Hey all, > > I was poking around and looking at `Distinct` and was confused about why it > was implemented the way it was. > > Reproduced here: > @ptransform_fn > @typehints.with_input_types(T) > @typehints.with_output_types(T) > def Distinct(pcoll): # pylint: disable=invalid-name > """Produces a PCollection containing distinct elements of a PCollection.""" > return ( > pcoll > | 'ToPairs' >> Map(lambda v: (v, None)) > | 'Group' >> CombinePerKey(lambda vs: None) > | 'Distinct' >> Keys()) > > Could anyone clarify why we'd use a `CombinePerKey` instead of just using > `GroupByKey`? > > Cheers, > Joey
[python] Why CombinePerKey(lambda vs: None)?
Hey all, I was poking around and looking at `Distinct` and was confused about why it was implemented the way it was. Reproduced here: @ptransform_fn @typehints.with_input_types(T) @typehints.with_output_types(T) def Distinct(pcoll): # pylint: disable=invalid-name """Produces a PCollection containing distinct elements of a PCollection.""" return ( pcoll | 'ToPairs' >> Map(lambda v: (v, None)) | 'Group' >> CombinePerKey(lambda vs: None) | 'Distinct' >> Keys()) Could anyone clarify why we'd use a `CombinePerKey` instead of just using `GroupByKey`? Cheers, Joey
Re: Beam Jenkins nodes offline
Hey Gavin, thanks for looking into this. I think there's a separate thread going for this, but we've now fully migrated all of our CI to GitHub Actions, so we can probably just safely turn down Beam's Jenkins instance. Valentyn opened a Jira ticket yesterday for this purpose - https://issues.apache.org/jira/browse/INFRA-25427 Thanks, Danny On Fri, Jan 26, 2024 at 2:42 AM Gavin McDonald wrote: > Hi All, > > Due to Infra migrating ci-beam to a new VM and also upgraded the > Controller, all Beam agents are currently offline. Who can work with me on > this to get them back up and running? > > I am online in Slack the-asf channel #asfinfra or just reply to this email > cc: gmcdonald@a.o > > -- > > *Gavin McDonald* > Systems Administrator > ASF Infrastructure Team >
Beam High Priority Issue Report (47)
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/29971 [Bug]: FixedWindows not working for large Kafka topic https://github.com/apache/beam/issues/29926 [Bug]: FileIO: lack of timeouts may cause the pipeline to get stuck indefinitely https://github.com/apache/beam/issues/29902 [Bug]: Messages are not ACK on Pubsub starting Beam 2.52.0 on Flink Runner in detached mode https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness doesn't update user counters in OnTimer callback functions https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader provided by apache beam does not pick the event time for watermarking https://github.com/apache/beam/issues/28383 [Failing Test]: org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric https://github.com/apache/beam/issues/28339 Fix failing "beam_PostCommit_XVR_GoUsingJava_Dataflow" job https://github.com/apache/beam/issues/28326 Bug: apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not working when using CreateDisposition.CREATE_IF_NEEDED https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. PeriodicImpulse) running in Flink and polling using tracker.defer_remainder have checkpoint size growing indefinitely https://github.com/apache/beam/issues/27616 [Bug]: Unable to use applyRowMutations() in bigquery IO apache beam java https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with inequality filters https://github.com/apache/beam/issues/27314 [Failing Test]: bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1] https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when using Kafka and GroupByKey on Dataflow Runner https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested ROW (described below) https://github.com/apache/beam/issues/26343 [Bug]: apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is flaky https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not propagate a Coder to AvroSource https://github.com/apache/beam/issues/26041 [Bug]: Unable to create exactly-once Flink pipeline with stream source and file sink https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK Harness ProcessBundleProgress https://github.com/apache/beam/issues/24389 [Failing Test]: HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError ContainerFetchException https://github.com/apache/beam/issues/24313 [Flaky]: apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder https://github.com/apache/beam/issues/23944 beam_PreCommit_Python_Cron regularily failing - test_pardo_large_input flaky https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder will drop message id and orderingKey https://github.com/apache/beam/issues/22913 [Bug]: beam_PostCommit_Java_ValidatesRunner_Flink is flakes in org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it https://github.com/apache/beam/issues/21714 PulsarIOTest.testReadFromSimpleTopic is very flaky https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial (order 1000 elements) numpy input flakes in non-cython environment https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table destinations returns wrong tableId https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: Connection refused https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) failing: ParDoTest$TimestampTests/OnWindowExpirationTests https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not follow spec https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit data at GC time https://github.com/apache/beam/issues/21121 apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it flakey https://github.com/apache/beam/issues/21104 Flaky: