Fwd: Community over Code EU 2024 Travel Assistance Applications now open!

2024-01-26 Thread Valentyn Tymofieiev via dev
FYI.

-- Forwarded message -

The Travel Assistance Committee (TAC) are pleased to announce that
travel assistance applications for Community over Code EU 2024 are now
open!

TAC will be supporting Community over Code EU, Bratislava, Slovakia,
June 3th - 5th, 2024.

TAC exists to help those that would like to attend Community over Code
events, but are unable to do so for financial reasons. For more info
on this years applications and qualifying criteria, please visit the
TAC website at < https://tac.apache.org/ >. Applications are already
open on https://tac-apply.apache.org/, so don't delay!

The Apache Travel Assistance Committee will only be accepting
applications from those people that are able to attend the full event.

Important: Applications close on Friday, March 1st, 2024.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as
required to efficiently and accurately process their request), this
will enable TAC to announce successful applications shortly
afterwards.

As usual, TAC expects to deal with a range of applications from a
diverse range of backgrounds; therefore, we encourage (as always)
anyone thinking about sending in an application to do so ASAP.

When replying, please reply to travel-assista...@apache.org


Fwd: [Important] GSoC 2024 Project Ideas

2024-01-26 Thread Ahmet Altay via dev
Hi all!

GSoC is a good opportunity to get new community members and a
good opportunity for you to mentor junior folks. If you have any ideas
please record them (https://s.apache.org/gsoc2024ideas) *by 3rd Feb*.

As for some ideas, these are good opportunities for (i) some low hanging
but low priority projects: usability, infra or website improvements, some
new IOs or IO related features (ii) good for trying prototyping out stuff
that we have not spend much time yet: ray on beam (iii) long awaiting
projects that are well known but not prioritized: fixing known runner
issues, running some benchmarks.

Feel free to share this information with other relevant folks.

Ahmet

-- Forwarded message -
From: Priya Sharma 
Date: Thu, Jan 25, 2024 at 4:21 AM
Subject: [Important] GSoC 2024 Project Ideas
To: , Swapnil M Mane ,
Sanyam Goel , Maxim Solodovnik 


Hello PMCs,

Google Summer of Code is the ideal opportunity for you to attract new
contributors to your projects and GSoC 2024 is here.

The ASF will be applying as a participating organization for GSoC 2024.
As a part of the application we need you all to *mandatorily* start
recording your ideas now [1] latest by 3rd Feb.

There is slight change in the rules this year, just reiterating here:
- For the 2024 program, there will be three options for project scope:
medium at ~175 hours, large at ~350 hours and a new size: small at ~90
hours.
  Please add "*full-time*" label to the JIRA for 350 hour project ,
"*part-time*" label for 175 hours project and “*small*” for a 90 hour
project.

Note: They are looking to bring more open source projects in the AI/ML
field into GSoC 2024, so we encourage more projects from this domain
to participate.

If you are a new mentor or your project is participating for the first
time, please read [2][3].

On behalf of the GSoC 2024 admins,
Please feel free to reach out to us in case of queries or concerns.

[1] https://s.apache.org/gsoc2024ideas
[2] https://community.apache.org/gsoc.html
[3] https://community.apache.org/guide-to-being-a-mentor.html


Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Ah! I understand now. Both GroupByKey _and_ CombineValues are joined
together and both run locally on the worker first. I forgot that GroupByKey
is also 'lifted'. So we groupbykey locally and drop the extraneous None's
locally so we don't need to unnecessarily communicate them back to the full
groupbykey.

Thanks for the clarification!

On Fri, Jan 26, 2024 at 12:03 PM Robert Bradshaw 
wrote:

> On Fri, Jan 26, 2024 at 8:43 AM Joey Tran 
> wrote:
> >
> > Hmm, I think I might still be missing something. CombinePerKey is made
> up of "GBK() | CombineValues". Pulling it out into the Distinct, Distinct
> looks like:
> >
> > def Distinct(pcoll):  # pylint: disable=invalid-name
> >   """Produces a PCollection containing distinct elements of a
> PCollection."""
> >   return (
> >  pcoll
> >  | 'ToPairs' >> Map(lambda v: (v, None))
> >   | 'Group' >> GroupByKey()
> >   | 'CombineValues >> CombineValues(lambda vs: None)
> >   | 'Distinct' >> Keys())
> >
> > Does the combiner lifting somehow make the GroupByKey operation more
> efficient despite coming after it? My intuition would suggest that we could
> just remove the `CombineValues` altogether
>
> The key property of CombineFns is that they are commutative and
> associative which permits an optimization called combiner lifting.
> Specifically, the operation
>
> GroupByKey() | CombineValues(C)
>
> re-written into
>
> PartialCombineUsingLocalBufferMap(C) | GroupByKey() | FinalCombine(C)
>
> that pretty much every runner supports (going back to the days of the
> original MapReduce), which is what can make this so much more
> efficient.
>
>
> https://github.com/apache/beam/blob/release-2.21.0/sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py#L669
>
> I am, unfortunately, coming up short in finding good documentation on
> this (Apache Beam specific or otherwise).
>
>
> > On Fri, Jan 26, 2024 at 11:33 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> This is because it allows us to do some of the deduplication before
> >> shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one
> >> worker and [B, B, B, B, C, C] on another. Rather than passing all that
> >> data through the GroupByKey (which involves (relatively) expensive
> >> materialization and cross-machine traffic, with this form the first
> >> worker will only emit [A, B] and the second [B, C] and only the B
> >> needs to be deduplicated post-shuffle.
> >>
> >> Wouldn't hurt to have a comment to that effect there.
> >>
> >> https://beam.apache.org/documentation/programming-guide/#combine
> >>
> >> On Fri, Jan 26, 2024 at 8:22 AM Joey Tran 
> wrote:
> >> >
> >> > Hey all,
> >> >
> >> > I was poking around and looking at `Distinct` and was confused about
> why it was implemented the way it was.
> >> >
> >> > Reproduced here:
> >> > @ptransform_fn
> >> > @typehints.with_input_types(T)
> >> > @typehints.with_output_types(T)
> >> > def Distinct(pcoll):  # pylint: disable=invalid-name
> >> >   """Produces a PCollection containing distinct elements of a
> PCollection."""
> >> >   return (
> >> >   pcoll
> >> >   | 'ToPairs' >> Map(lambda v: (v, None))
> >> >   | 'Group' >> CombinePerKey(lambda vs: None)
> >> >   | 'Distinct' >> Keys())
> >> >
> >> > Could anyone clarify why we'd use a `CombinePerKey` instead of just
> using `GroupByKey`?
> >> >
> >> > Cheers,
> >> > Joey
>


Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Robert Bradshaw via dev
On Fri, Jan 26, 2024 at 8:43 AM Joey Tran  wrote:
>
> Hmm, I think I might still be missing something. CombinePerKey is made up of 
> "GBK() | CombineValues". Pulling it out into the Distinct, Distinct looks 
> like:
>
> def Distinct(pcoll):  # pylint: disable=invalid-name
>   """Produces a PCollection containing distinct elements of a PCollection."""
>   return (
>  pcoll
>  | 'ToPairs' >> Map(lambda v: (v, None))
>   | 'Group' >> GroupByKey()
>   | 'CombineValues >> CombineValues(lambda vs: None)
>   | 'Distinct' >> Keys())
>
> Does the combiner lifting somehow make the GroupByKey operation more 
> efficient despite coming after it? My intuition would suggest that we could 
> just remove the `CombineValues` altogether

The key property of CombineFns is that they are commutative and
associative which permits an optimization called combiner lifting.
Specifically, the operation

GroupByKey() | CombineValues(C)

re-written into

PartialCombineUsingLocalBufferMap(C) | GroupByKey() | FinalCombine(C)

that pretty much every runner supports (going back to the days of the
original MapReduce), which is what can make this so much more
efficient.

https://github.com/apache/beam/blob/release-2.21.0/sdks/python/apache_beam/runners/portability/fn_api_runner/translations.py#L669

I am, unfortunately, coming up short in finding good documentation on
this (Apache Beam specific or otherwise).


> On Fri, Jan 26, 2024 at 11:33 AM Robert Bradshaw via dev 
>  wrote:
>>
>> This is because it allows us to do some of the deduplication before
>> shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one
>> worker and [B, B, B, B, C, C] on another. Rather than passing all that
>> data through the GroupByKey (which involves (relatively) expensive
>> materialization and cross-machine traffic, with this form the first
>> worker will only emit [A, B] and the second [B, C] and only the B
>> needs to be deduplicated post-shuffle.
>>
>> Wouldn't hurt to have a comment to that effect there.
>>
>> https://beam.apache.org/documentation/programming-guide/#combine
>>
>> On Fri, Jan 26, 2024 at 8:22 AM Joey Tran  wrote:
>> >
>> > Hey all,
>> >
>> > I was poking around and looking at `Distinct` and was confused about why 
>> > it was implemented the way it was.
>> >
>> > Reproduced here:
>> > @ptransform_fn
>> > @typehints.with_input_types(T)
>> > @typehints.with_output_types(T)
>> > def Distinct(pcoll):  # pylint: disable=invalid-name
>> >   """Produces a PCollection containing distinct elements of a 
>> > PCollection."""
>> >   return (
>> >   pcoll
>> >   | 'ToPairs' >> Map(lambda v: (v, None))
>> >   | 'Group' >> CombinePerKey(lambda vs: None)
>> >   | 'Distinct' >> Keys())
>> >
>> > Could anyone clarify why we'd use a `CombinePerKey` instead of just using 
>> > `GroupByKey`?
>> >
>> > Cheers,
>> > Joey


Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Hmm, I think I might still be missing something. CombinePerKey is made up
of "GBK() | CombineValues". Pulling it out into the Distinct, Distinct
looks like:

def Distinct(pcoll):  # pylint: disable=invalid-name
  """Produces a PCollection containing distinct elements of a
PCollection."""
  return (
 pcoll
 | 'ToPairs' >> Map(lambda v: (v, None))
  | 'Group' >> GroupByKey()
  | 'CombineValues >> CombineValues(lambda vs: None)
  | 'Distinct' >> Keys())

Does the combiner lifting somehow make the GroupByKey operation more
efficient despite coming after it? My intuition would suggest that we could
just remove the `CombineValues` altogether



On Fri, Jan 26, 2024 at 11:33 AM Robert Bradshaw via dev <
dev@beam.apache.org> wrote:

> This is because it allows us to do some of the deduplication before
> shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one
> worker and [B, B, B, B, C, C] on another. Rather than passing all that
> data through the GroupByKey (which involves (relatively) expensive
> materialization and cross-machine traffic, with this form the first
> worker will only emit [A, B] and the second [B, C] and only the B
> needs to be deduplicated post-shuffle.
>
> Wouldn't hurt to have a comment to that effect there.
>
> https://beam.apache.org/documentation/programming-guide/#combine
>
> On Fri, Jan 26, 2024 at 8:22 AM Joey Tran 
> wrote:
> >
> > Hey all,
> >
> > I was poking around and looking at `Distinct` and was confused about why
> it was implemented the way it was.
> >
> > Reproduced here:
> > @ptransform_fn
> > @typehints.with_input_types(T)
> > @typehints.with_output_types(T)
> > def Distinct(pcoll):  # pylint: disable=invalid-name
> >   """Produces a PCollection containing distinct elements of a
> PCollection."""
> >   return (
> >   pcoll
> >   | 'ToPairs' >> Map(lambda v: (v, None))
> >   | 'Group' >> CombinePerKey(lambda vs: None)
> >   | 'Distinct' >> Keys())
> >
> > Could anyone clarify why we'd use a `CombinePerKey` instead of just
> using `GroupByKey`?
> >
> > Cheers,
> > Joey
>


Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Robert Bradshaw via dev
This is because it allows us to do some of the deduplication before
shuffle via combiner lifting. E.g. say we have [A, A, A, B, B] on one
worker and [B, B, B, B, C, C] on another. Rather than passing all that
data through the GroupByKey (which involves (relatively) expensive
materialization and cross-machine traffic, with this form the first
worker will only emit [A, B] and the second [B, C] and only the B
needs to be deduplicated post-shuffle.

Wouldn't hurt to have a comment to that effect there.

https://beam.apache.org/documentation/programming-guide/#combine

On Fri, Jan 26, 2024 at 8:22 AM Joey Tran  wrote:
>
> Hey all,
>
> I was poking around and looking at `Distinct` and was confused about why it 
> was implemented the way it was.
>
> Reproduced here:
> @ptransform_fn
> @typehints.with_input_types(T)
> @typehints.with_output_types(T)
> def Distinct(pcoll):  # pylint: disable=invalid-name
>   """Produces a PCollection containing distinct elements of a PCollection."""
>   return (
>   pcoll
>   | 'ToPairs' >> Map(lambda v: (v, None))
>   | 'Group' >> CombinePerKey(lambda vs: None)
>   | 'Distinct' >> Keys())
>
> Could anyone clarify why we'd use a `CombinePerKey` instead of just using 
> `GroupByKey`?
>
> Cheers,
> Joey


[python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Hey all,

I was poking around and looking at `Distinct` and was confused about why it
was implemented the way it was.

Reproduced here:
@ptransform_fn
@typehints.with_input_types(T)
@typehints.with_output_types(T)
def Distinct(pcoll):  # pylint: disable=invalid-name
  """Produces a PCollection containing distinct elements of a
PCollection."""
  return (
  pcoll
  | 'ToPairs' >> Map(lambda v: (v, None))
  | 'Group' >> CombinePerKey(lambda vs: None)
  | 'Distinct' >> Keys())

Could anyone clarify why we'd use a `CombinePerKey` instead of just using
`GroupByKey`?

Cheers,
Joey


Re: Beam Jenkins nodes offline

2024-01-26 Thread Danny McCormick via dev
Hey Gavin, thanks for looking into this. I think there's a separate thread
going for this, but we've now fully migrated all of our CI to GitHub
Actions, so we can probably just safely turn down Beam's Jenkins instance.
Valentyn opened a Jira ticket yesterday for this purpose -
https://issues.apache.org/jira/browse/INFRA-25427

Thanks,
Danny

On Fri, Jan 26, 2024 at 2:42 AM Gavin McDonald  wrote:

> Hi All,
>
> Due to Infra migrating ci-beam to a new VM and also upgraded the
> Controller, all Beam agents are currently offline. Who can work with me on
> this to get them back up and running?
>
> I am online in Slack the-asf channel #asfinfra or just reply to this email
> cc: gmcdonald@a.o
>
> --
>
> *Gavin McDonald*
> Systems Administrator
> ASF Infrastructure Team
>


Beam High Priority Issue Report (47)

2024-01-26 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/29971 [Bug]: FixedWindows not working for 
large Kafka topic
https://github.com/apache/beam/issues/29926 [Bug]: FileIO: lack of timeouts may 
cause the pipeline to get stuck indefinitely
https://github.com/apache/beam/issues/29902 [Bug]: Messages are not ACK on 
Pubsub starting Beam 2.52.0 on Flink Runner in detached mode
https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness 
doesn't update user counters in OnTimer callback functions
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: