RE: Re: [YAML] ReadFromKafka with yaml

2024-01-10 Thread Yarden BenMoshe
Thanks for the detailed answer.
I forgot to mention that I am using FlinkRunner as my   Setup. Will this
work with this runner as well?


On 2024/01/10 13:34:28 Ferran Fernández Garrido wrote:
> Hi Yarden,
>
> If you are using Dataflow as a runner, you can already use
> ReadFromKafka (introduced originally in version 2.52). Dataflow will
> handle the expansion service automatically, so you don't have to do
> anything.
>
> If you want to run it locally for development purposes, you'll have to
> build the Docker image. You can check out the project and run:
>
> ./gradlew :sdks:java:container:java8:docker
> -Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest (DOCKER ROOT
> -> repo location)
>
> Then, for instance, if you want to run your custom Docker image in
> Dataflow, you could do this:
>
> (Build the Python SDK -> python setup.py sdist to get
> apache-beam-2.53.0.dev0.tar.gz)
>
> You'll have to build the expansion service that Kafka uses (in case
> you've changed something in the KafkaIO) : ./gradlew
> :sdks:java:io:expansion-service:build
>
> python3 -m apache_beam.yaml.main --runner=DataflowRunner
> --project=project_id --region=region --temp_location=temp_location
> --pipeline_spec_file=yaml_pipeline.yml
> --staging_location=staging_location
> --sdk_location="path/apache-beam-2.53.0.dev0.tar.gz"
> --sdk_harness_container_image_overrides=".*java.*,$DOCKER_ROOT:latest"
> --streaming
>
> This is an example of how to read JSON events from Kafka in Beam YAML:
>
> - type: ReadFromKafka
> config:
> topic: 'TOPIC_NAME'
> format: JSON
> bootstrap_servers: 'BOOTSTRAP_SERVERS'
> schema: 'JSON_SCHEMA'
>
> Best,
> Ferran
>
> El mié, 10 ene 2024 a las 14:11, Yarden BenMoshe
> () escribió:
> >
> > Hi,
> >
> > I am trying to consume a kafka topic using ReadFromKafka transform.
> >
> > If i got it right, since ReadFromKafka is originally written in java,
an expansion service is needed and default env is set to DOCKER, and in
current implementation I can see that expansion service field is not
adjustable (im not able to pass it as part of the transform's config).
> > Is there currently a way to ReadFromKafka from a pipeline written with
yaml api? If so, an explanation would be much appreciated.
> >
> > I saw there's some workaround suggested online of using
Docker-in-Docker but would prefer to avoid it.
> >
> > Thanks
> > Yarden
>


Re: Beam 2.54.0 Release

2024-01-10 Thread Robert Burke
Not sure why newlines were eaten. Hopefully reflowed inline below. 

On 2024/01/10 17:53:56 Robert Burke wrote:
> Hey everyone, Happy New Year!
>
> The next release (2.54.0) branch cut is scheduled for Jan 24, 2024, 2 weeks
> from today, according to the release calendar [1]. I'd like to perform this
> release; I will cut the branch on that date, and cherrypick remaining 
> release-blocking fixes afterwards, if any. 
>
> Please help with the release by: 
>
> - Making sure that any unresolved release blocking issues have
> their "Milestone" marked as "2.54.0 Release" as soon as possible.
> -Reviewing the current release blockers [2] and remove the Milestone if they
> don't meet the criteria at [3]. There are currently 8 release blockers.
>
> Let me know if you have any comments/objections/questions.
> Thanks, Robert Burke
> 
> [1] 
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>
> [2] https://github.com/apache/beam/milestone/18
>
> [3]  https://beam.apache.org/contribute/release-blocking/
> 


Re: ByteBuddy DoFnInvokers Write Up

2024-01-10 Thread Robert Burke
That's neat! Thanks for writing that up!

On Wed, Jan 10, 2024, 11:12 AM John Casey via dev 
wrote:

> The team at Google recently held an internal hackathon, and my hack
> involved modifying how our ByteBuddy DoFnInvokers work. My hack didn't end
> up going anywhere, but I learned a lot about how our code generation works.
> It turns out we have no documentation or design docs about our code
> generation, so I wrote up what I learned,
>
> Please take a look, and let me know if I got anything wrong, or if you are
> looking for more detail
>
> s.apache.org/beam-bytebuddy-dofninvoker
>
> John
>


ByteBuddy DoFnInvokers Write Up

2024-01-10 Thread John Casey via dev
The team at Google recently held an internal hackathon, and my hack
involved modifying how our ByteBuddy DoFnInvokers work. My hack didn't end
up going anywhere, but I learned a lot about how our code generation works.
It turns out we have no documentation or design docs about our code
generation, so I wrote up what I learned,

Please take a look, and let me know if I got anything wrong, or if you are
looking for more detail

s.apache.org/beam-bytebuddy-dofninvoker

John


Beam 2.54.0 Release

2024-01-10 Thread Robert Burke
Hey everyone, Happy New Year!
The next release (2.54.0) branch cut is scheduled for Jan 24, 2024, 2 weeks
from today, according to the release calendar [1]. I'd like to perform this
release; I will cut the branch on that date, and cherrypick
remaining release-blocking fixes afterwards, if any. Please help with the
release by: - Making sure that any unresolved release blocking issues have
their "Milestone" marked as "2.54.0 Release" as soon as possible. -
Reviewing the current release blockers [2] and remove the Milestone if they
don't meet the criteria at [3]. There are currently 8 release blockers. Let
me know if you have any comments/objections/questions. Thanks, Robert Burke
[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2] https://github.com/apache/beam/milestone/18 [3]
https://beam.apache.org/contribute/release-blocking/


Re: [YAML] ReadFromKafka with yaml

2024-01-10 Thread Ferran Fernández Garrido
Hi Yarden,

If you are using Dataflow as a runner, you can already use
ReadFromKafka (introduced originally in version 2.52). Dataflow will
handle the expansion service automatically, so you don't have to do
anything.

If you want to run it locally for development purposes, you'll have to
build the Docker image. You can check out the project and run:

./gradlew :sdks:java:container:java8:docker
-Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest (DOCKER ROOT
-> repo location)

Then, for instance, if you want to run your custom Docker image in
Dataflow, you could do this:

(Build the Python SDK -> python setup.py sdist to get
apache-beam-2.53.0.dev0.tar.gz)

You'll have to build the expansion service that Kafka uses (in case
you've changed something in the KafkaIO) : ./gradlew
:sdks:java:io:expansion-service:build

python3 -m apache_beam.yaml.main --runner=DataflowRunner
--project=project_id --region=region --temp_location=temp_location
--pipeline_spec_file=yaml_pipeline.yml
--staging_location=staging_location
--sdk_location="path/apache-beam-2.53.0.dev0.tar.gz"
--sdk_harness_container_image_overrides=".*java.*,$DOCKER_ROOT:latest"
--streaming

This is an example of how to read JSON events from Kafka in Beam YAML:

- type: ReadFromKafka
  config:
  topic: 'TOPIC_NAME'
  format: JSON
  bootstrap_servers: 'BOOTSTRAP_SERVERS'
  schema: 'JSON_SCHEMA'

Best,
Ferran

El mié, 10 ene 2024 a las 14:11, Yarden BenMoshe
() escribió:
>
> Hi,
>
> I am trying to consume a kafka topic using ReadFromKafka transform.
>
> If i got it right, since ReadFromKafka is originally written in java, an 
> expansion service is needed and default env is set to DOCKER, and in current 
> implementation I can see that expansion service field is not adjustable (im 
> not able to pass it as part of the transform's config).
> Is there currently a way to ReadFromKafka from a pipeline written with yaml 
> api? If so, an explanation would be much appreciated.
>
> I saw there's some workaround suggested online of using Docker-in-Docker but 
> would prefer to avoid it.
>
> Thanks
> Yarden


[YAML] ReadFromKafka with yaml

2024-01-10 Thread Yarden BenMoshe
Hi,

I am trying to consume a kafka topic using ReadFromKafka transform.

If i got it right, since ReadFromKafka is originally written in java, an
expansion service is needed and default env is set to DOCKER, and in
current implementation I can see that expansion service field is not
adjustable (im not able to pass it as part of the transform's config).
Is there currently a way to ReadFromKafka from a pipeline written with yaml
api? If so, an explanation would be much appreciated.

I saw there's some workaround suggested online of using Docker-in-Docker
but would prefer to avoid it.

Thanks
Yarden


Beam High Priority Issue Report (51)

2024-01-10 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/29926 [Bug]: FileIO: lack of timeouts may 
cause the pipeline to get stuck indefinitely
https://github.com/apache/beam/issues/29912 [Bug]: floatValueExtractor judge 
float and double equality directly
https://github.com/apache/beam/issues/29825 [Bug]: Usage of logical types 
breaks Beam YAML Sql
https://github.com/apache/beam/issues/29413 [Bug]: Can not use Avro over 1.8.2 
with Beam 2.52.0
https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness 
doesn't update user counters in OnTimer callback functions
https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github 
actions tests are failing due to update of pip 
https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader 
provided by apache beam does not pick the event time for watermarking
https://github.com/apache/beam/issues/28715 [Bug]: Python WriteToBigtable get 
stuck for large jobs due to client dead lock
https://github.com/apache/beam/issues/28383 [Failing Test]: 
org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric
https://github.com/apache/beam/issues/28339 Fix failing 
"beam_PostCommit_XVR_GoUsingJava_Dataflow" job
https://github.com/apache/beam/issues/28326 Bug: 
apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working
https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be 
leaking on 2.49.0 with Dataflow
https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not 
working when using CreateDisposition.CREATE_IF_NEEDED 
https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. 
PeriodicImpulse) running in Flink and polling using tracker.defer_remainder 
have checkpoint size growing indefinitely 
https://github.com/apache/beam/issues/27616 [Bug]: Unable to use 
applyRowMutations() in bigquery IO apache beam java
https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with 
inequality filters
https://github.com/apache/beam/issues/27314 [Failing Test]: 
bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1]
https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when 
using Kafka and GroupByKey on Dataflow Runner
https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issue