RE: Re: [YAML] ReadFromKafka with yaml
Thanks for the detailed answer. I forgot to mention that I am using FlinkRunner as my Setup. Will this work with this runner as well? On 2024/01/10 13:34:28 Ferran Fernández Garrido wrote: > Hi Yarden, > > If you are using Dataflow as a runner, you can already use > ReadFromKafka (introduced originally in version 2.52). Dataflow will > handle the expansion service automatically, so you don't have to do > anything. > > If you want to run it locally for development purposes, you'll have to > build the Docker image. You can check out the project and run: > > ./gradlew :sdks:java:container:java8:docker > -Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest (DOCKER ROOT > -> repo location) > > Then, for instance, if you want to run your custom Docker image in > Dataflow, you could do this: > > (Build the Python SDK -> python setup.py sdist to get > apache-beam-2.53.0.dev0.tar.gz) > > You'll have to build the expansion service that Kafka uses (in case > you've changed something in the KafkaIO) : ./gradlew > :sdks:java:io:expansion-service:build > > python3 -m apache_beam.yaml.main --runner=DataflowRunner > --project=project_id --region=region --temp_location=temp_location > --pipeline_spec_file=yaml_pipeline.yml > --staging_location=staging_location > --sdk_location="path/apache-beam-2.53.0.dev0.tar.gz" > --sdk_harness_container_image_overrides=".*java.*,$DOCKER_ROOT:latest" > --streaming > > This is an example of how to read JSON events from Kafka in Beam YAML: > > - type: ReadFromKafka > config: > topic: 'TOPIC_NAME' > format: JSON > bootstrap_servers: 'BOOTSTRAP_SERVERS' > schema: 'JSON_SCHEMA' > > Best, > Ferran > > El mié, 10 ene 2024 a las 14:11, Yarden BenMoshe > () escribió: > > > > Hi, > > > > I am trying to consume a kafka topic using ReadFromKafka transform. > > > > If i got it right, since ReadFromKafka is originally written in java, an expansion service is needed and default env is set to DOCKER, and in current implementation I can see that expansion service field is not adjustable (im not able to pass it as part of the transform's config). > > Is there currently a way to ReadFromKafka from a pipeline written with yaml api? If so, an explanation would be much appreciated. > > > > I saw there's some workaround suggested online of using Docker-in-Docker but would prefer to avoid it. > > > > Thanks > > Yarden >
Re: Beam 2.54.0 Release
Not sure why newlines were eaten. Hopefully reflowed inline below. On 2024/01/10 17:53:56 Robert Burke wrote: > Hey everyone, Happy New Year! > > The next release (2.54.0) branch cut is scheduled for Jan 24, 2024, 2 weeks > from today, according to the release calendar [1]. I'd like to perform this > release; I will cut the branch on that date, and cherrypick remaining > release-blocking fixes afterwards, if any. > > Please help with the release by: > > - Making sure that any unresolved release blocking issues have > their "Milestone" marked as "2.54.0 Release" as soon as possible. > -Reviewing the current release blockers [2] and remove the Milestone if they > don't meet the criteria at [3]. There are currently 8 release blockers. > > Let me know if you have any comments/objections/questions. > Thanks, Robert Burke > > [1] > https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com > > [2] https://github.com/apache/beam/milestone/18 > > [3] https://beam.apache.org/contribute/release-blocking/ >
Re: ByteBuddy DoFnInvokers Write Up
That's neat! Thanks for writing that up! On Wed, Jan 10, 2024, 11:12 AM John Casey via dev wrote: > The team at Google recently held an internal hackathon, and my hack > involved modifying how our ByteBuddy DoFnInvokers work. My hack didn't end > up going anywhere, but I learned a lot about how our code generation works. > It turns out we have no documentation or design docs about our code > generation, so I wrote up what I learned, > > Please take a look, and let me know if I got anything wrong, or if you are > looking for more detail > > s.apache.org/beam-bytebuddy-dofninvoker > > John >
ByteBuddy DoFnInvokers Write Up
The team at Google recently held an internal hackathon, and my hack involved modifying how our ByteBuddy DoFnInvokers work. My hack didn't end up going anywhere, but I learned a lot about how our code generation works. It turns out we have no documentation or design docs about our code generation, so I wrote up what I learned, Please take a look, and let me know if I got anything wrong, or if you are looking for more detail s.apache.org/beam-bytebuddy-dofninvoker John
Beam 2.54.0 Release
Hey everyone, Happy New Year! The next release (2.54.0) branch cut is scheduled for Jan 24, 2024, 2 weeks from today, according to the release calendar [1]. I'd like to perform this release; I will cut the branch on that date, and cherrypick remaining release-blocking fixes afterwards, if any. Please help with the release by: - Making sure that any unresolved release blocking issues have their "Milestone" marked as "2.54.0 Release" as soon as possible. - Reviewing the current release blockers [2] and remove the Milestone if they don't meet the criteria at [3]. There are currently 8 release blockers. Let me know if you have any comments/objections/questions. Thanks, Robert Burke [1] https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com [2] https://github.com/apache/beam/milestone/18 [3] https://beam.apache.org/contribute/release-blocking/
Re: [YAML] ReadFromKafka with yaml
Hi Yarden, If you are using Dataflow as a runner, you can already use ReadFromKafka (introduced originally in version 2.52). Dataflow will handle the expansion service automatically, so you don't have to do anything. If you want to run it locally for development purposes, you'll have to build the Docker image. You can check out the project and run: ./gradlew :sdks:java:container:java8:docker -Pdocker-repository-root=$DOCKER_ROOT -Pdocker-tag=latest (DOCKER ROOT -> repo location) Then, for instance, if you want to run your custom Docker image in Dataflow, you could do this: (Build the Python SDK -> python setup.py sdist to get apache-beam-2.53.0.dev0.tar.gz) You'll have to build the expansion service that Kafka uses (in case you've changed something in the KafkaIO) : ./gradlew :sdks:java:io:expansion-service:build python3 -m apache_beam.yaml.main --runner=DataflowRunner --project=project_id --region=region --temp_location=temp_location --pipeline_spec_file=yaml_pipeline.yml --staging_location=staging_location --sdk_location="path/apache-beam-2.53.0.dev0.tar.gz" --sdk_harness_container_image_overrides=".*java.*,$DOCKER_ROOT:latest" --streaming This is an example of how to read JSON events from Kafka in Beam YAML: - type: ReadFromKafka config: topic: 'TOPIC_NAME' format: JSON bootstrap_servers: 'BOOTSTRAP_SERVERS' schema: 'JSON_SCHEMA' Best, Ferran El mié, 10 ene 2024 a las 14:11, Yarden BenMoshe () escribió: > > Hi, > > I am trying to consume a kafka topic using ReadFromKafka transform. > > If i got it right, since ReadFromKafka is originally written in java, an > expansion service is needed and default env is set to DOCKER, and in current > implementation I can see that expansion service field is not adjustable (im > not able to pass it as part of the transform's config). > Is there currently a way to ReadFromKafka from a pipeline written with yaml > api? If so, an explanation would be much appreciated. > > I saw there's some workaround suggested online of using Docker-in-Docker but > would prefer to avoid it. > > Thanks > Yarden
[YAML] ReadFromKafka with yaml
Hi, I am trying to consume a kafka topic using ReadFromKafka transform. If i got it right, since ReadFromKafka is originally written in java, an expansion service is needed and default env is set to DOCKER, and in current implementation I can see that expansion service field is not adjustable (im not able to pass it as part of the transform's config). Is there currently a way to ReadFromKafka from a pipeline written with yaml api? If so, an explanation would be much appreciated. I saw there's some workaround suggested online of using Docker-in-Docker but would prefer to avoid it. Thanks Yarden
Beam High Priority Issue Report (51)
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/29926 [Bug]: FileIO: lack of timeouts may cause the pipeline to get stuck indefinitely https://github.com/apache/beam/issues/29912 [Bug]: floatValueExtractor judge float and double equality directly https://github.com/apache/beam/issues/29825 [Bug]: Usage of logical types breaks Beam YAML Sql https://github.com/apache/beam/issues/29413 [Bug]: Can not use Avro over 1.8.2 with Beam 2.52.0 https://github.com/apache/beam/issues/29099 [Bug]: FnAPI Java SDK Harness doesn't update user counters in OnTimer callback functions https://github.com/apache/beam/issues/29022 [Failing Test]: Python Github actions tests are failing due to update of pip https://github.com/apache/beam/issues/28760 [Bug]: EFO Kinesis IO reader provided by apache beam does not pick the event time for watermarking https://github.com/apache/beam/issues/28715 [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock https://github.com/apache/beam/issues/28383 [Failing Test]: org.apache.beam.runners.dataflow.worker.StreamingDataflowWorkerTest.testMaxThreadMetric https://github.com/apache/beam/issues/28339 Fix failing "beam_PostCommit_XVR_GoUsingJava_Dataflow" job https://github.com/apache/beam/issues/28326 Bug: apache_beam.io.gcp.pubsublite.ReadFromPubSubLite not working https://github.com/apache/beam/issues/28142 [Bug]: [Go SDK] Memory seems to be leaking on 2.49.0 with Dataflow https://github.com/apache/beam/issues/27892 [Bug]: ignoreUnknownValues not working when using CreateDisposition.CREATE_IF_NEEDED https://github.com/apache/beam/issues/27648 [Bug]: Python SDFs (e.g. PeriodicImpulse) running in Flink and polling using tracker.defer_remainder have checkpoint size growing indefinitely https://github.com/apache/beam/issues/27616 [Bug]: Unable to use applyRowMutations() in bigquery IO apache beam java https://github.com/apache/beam/issues/27486 [Bug]: Read from datastore with inequality filters https://github.com/apache/beam/issues/27314 [Failing Test]: bigquery.StorageApiSinkCreateIfNeededIT.testCreateManyTables[1] https://github.com/apache/beam/issues/27238 [Bug]: Window trigger has lag when using Kafka and GroupByKey on Dataflow Runner https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested ROW (described below) https://github.com/apache/beam/issues/26343 [Bug]: apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is flaky https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not propagate a Coder to AvroSource https://github.com/apache/beam/issues/26041 [Bug]: Unable to create exactly-once Flink pipeline with stream source and file sink https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK Harness ProcessBundleProgress https://github.com/apache/beam/issues/24389 [Failing Test]: HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError ContainerFetchException https://github.com/apache/beam/issues/24313 [Flaky]: apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder https://github.com/apache/beam/issues/23944 beam_PreCommit_Python_Cron regularily failing - test_pardo_large_input flaky https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder will drop message id and orderingKey https://github.com/apache/beam/issues/22913 [Bug]: beam_PostCommit_Java_ValidatesRunner_Flink is flakes in org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it https://github.com/apache/beam/issues/21714 PulsarIOTest.testReadFromSimpleTopic is very flaky https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit test action StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial (order 1000 elements) numpy input flakes in non-cython environment https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table destinations returns wrong tableId https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: Connection refused https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) failing: ParDoTest$TimestampTests/OnWindowExpirationTests https://github.com/apache/beam/issue