Re: Introducing beam.MLTransform

2023-05-15 Thread Ahmet Altay via dev
Thank you for writing this, Anand! I think this is a good way to leverage
existing libraries and also a good usability improvement.

On Wed, May 10, 2023 at 11:04 AM XQ Hu via dev  wrote:

> Agree with Danny. Thanks for writing this!
>
> On Wed, May 10, 2023 at 10:35 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> Thanks Anand! I left a few comments, but overall I think this is a
>> great/well constructed proposal - this is a good way to quickly bring a
>> suite of preprocessing operations to Beam.
>>
>> Thanks,
>> Danny
>>
>> On Tue, May 9, 2023 at 12:52 PM Anand Inguva via dev 
>> wrote:
>>
>>> Hi all,
>>>
>>> In Apache Beam, we plan to introduce a *beam.MLTransform* for carrying
>>> out common ML centric processing tasks.
>>>
>>> Using the tensorflow_transform as the backend, we will introduce several
>>> data processing transforms in Beam. These can be easily utilized by simply
>>> wrapping them with the beam.MLTransform. This approach not only simplifies
>>> the process but also creates a pathway for implementing a comprehensive set
>>> of ML-centric data processing transforms in Apache Beam.
>>>
>>> Please go through the design doc
>>> [1]
>>> for details and share your feedback.
>>>
>>> Thanks,
>>> Anand
>>>
>>> [1]
>>> https://docs.google.com/document/d/1rQkSm_8tseLqDQaLohtlCGqt5pvMaP0XIpPi5UD0LCQ/edit#
>>>
>>


[Benchmarks/Proposal] Loading models with multi_process_shared.py for RunInference

2023-05-15 Thread Danny McCormick via dev
Hey everyone,

Right now, using RunInference with large models and on GPUs has several
performance gaps. I put together a document focusing on one: when running
inference with large models, we often OOM because we load several copies of
the model at once. My document explores using the multi_process_shared.py
utility to load models, provides a couple of benchmarks, and finds that we
can recommend using the utility for pipelines which load a large model for
inference, but not for pipelines that normally don’t have memory issues.

Please take a look and let me know if you have any questions or concerns!

Doc -
https://docs.google.com/document/d/10xAIxu3W3wonFaLWXqneZ3CmOLaS1Z9dvn3eSDynDqE/edit?usp=sharing

Thanks.
Danny


Re: [DISCUSS] Idempotent initialization of file systems

2023-05-15 Thread Robert Bradshaw via dev
On Mon, May 15, 2023 at 8:38 AM Moritz Mack  wrote:
>
> Hi all,
>
> I was just looking into an old issue again, SerializablePipelineOptions 
> calling FileSystems.setDefaultPipelineOptions on deserialization [1]. This 
> applies to various runners including Flink and Spark, but not Dataflow as far 
> as I know.
>
> Problem:
>
> Current initialization of FileSystems through 
> FileSystems.setDefaultPipelineOptions is rather problematic and prone to race 
> conditions, especially when triggered on deserialization of 
> SerializablePipelineOptions (see [1], [2], [3]).
>
> Even further, there’s also an inherent risk of leaking resources that way: 
> Without a well-defined lifecycle for file systems, existing ones are just 
> silently replaced on every invocation of 
> FileSystems.setDefaultPipelineOptions without adequately closing attached 
> resources. Particularly with S3FileSystem this is troublesome as it might 
> leak threads [4].
>
> Possible solutions:
>
> In the best-case pipeline options would be read only as soon as a pipeline is 
> running, making it simple and safe to initialize file systems just once per 
> running pipeline on each worker. Though, that’s likely not the case for some 
> user pipelines and even runners do mutate pipeline options.

Would it be possible to validate this? E.g. what if we made it illegal
to call this more than once (or, at least, without a corresponding
unset operation that could be used for tests)?

> Also, as far as I can see, removing FileSystems.setDefaultPipelineOptions 
> from deserialization in SerializablePipelineOptions is unlikely to happen any 
> time soon as it requires a coordinated push across various runners and it’s 
> not obvious where and when initialization is supposed to happen for each 
> runner.
>
> With the above in mind, it would be at least possible to safely limit 
> repeated initialization of file systems to cases when necessary if tracking 
> the revision of pipeline options (using monotonically increasing revision 
> numbers for every update), see this draft PR [5].

If the above is too hard to untangle, this could be a reasonable workaround.

> Happy to hear your thoughts or alternative approaches on this.

Thanks for taking this on.

> [1] https://github.com/apache/beam/issues/18430
>
> [2] https://issues.apache.org/jira/browse/BEAM-14465
>
> [3] https://issues.apache.org/jira/browse/BEAM-14355
>
> [4] https://github.com/apache/beam/issues/26321
>
> [5] https://github.com/apache/beam/pull/26694
>
> As a recipient of an email from the Talend Group, your personal data will be 
> processed by our systems. Please see our Privacy Notice for more information 
> about our collection and use of your personal information, our security 
> practices, and your data protection rights, including any rights you may have 
> to object to automated-decision making or profiling we use to analyze support 
> or marketing related communications. To manage or discontinue promotional 
> communications, use the communication preferences portal. To exercise your 
> data protection rights, use the privacy request form. Contact us here or by 
> mail to either of our co-headquarters: Talend, Inc.: 400 South El Camino 
> Real, Ste 1400, San Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De 
> Rothschild, 92150 Suresnes, France


[ANNOUNCE] Beam 2.47.0 Released

2023-05-15 Thread Jack McCluskey
The Apache Beam Team is pleased to announce the release of version 2.47.0.

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on the
Beam Blog: https://beam.apache.org/blog/beam-2.47.0/ and the Github release
page https://github.com/apache/beam/releases/tag/v2.47.0

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.47.0.

-- Jack, on behalf of the Apache Beam Team.


[DISCUSS] Idempotent initialization of file systems

2023-05-15 Thread Moritz Mack
Hi all,

I was just looking into an old issue again, SerializablePipelineOptions calling 
FileSystems.setDefaultPipelineOptions on deserialization [1]. This applies to 
various runners including Flink and Spark, but not Dataflow as far as I know.

Problem:

Current initialization of FileSystems through 
FileSystems.setDefaultPipelineOptions is rather problematic and prone to race 
conditions, especially when triggered on deserialization of 
SerializablePipelineOptions (see [1], [2], [3]).

Even further, there’s also an inherent risk of leaking resources that way: 
Without a well-defined lifecycle for file systems, existing ones are just 
silently replaced on every invocation of FileSystems.setDefaultPipelineOptions 
without adequately closing attached resources. Particularly with S3FileSystem 
this is troublesome as it might leak threads [4].

Possible solutions:

In the best-case pipeline options would be read only as soon as a pipeline is 
running, making it simple and safe to initialize file systems just once per 
running pipeline on each worker. Though, that’s likely not the case for some 
user pipelines and even runners do mutate pipeline options.

Also, as far as I can see, removing FileSystems.setDefaultPipelineOptions from 
deserialization in SerializablePipelineOptions is unlikely to happen any time 
soon as it requires a coordinated push across various runners and it’s not 
obvious where and when initialization is supposed to happen for each runner.

With the above in mind, it would be at least possible to safely limit repeated 
initialization of file systems to cases when necessary if tracking the revision 
of pipeline options (using monotonically increasing revision numbers for every 
update), see this draft PR [5].

Happy to hear your thoughts or alternative approaches on this.

Best,
Moritz

[1] https://github.com/apache/beam/issues/18430
[2] https://issues.apache.org/jira/browse/BEAM-14465
[3] https://issues.apache.org/jira/browse/BEAM-14355
[4] https://github.com/apache/beam/issues/26321
[5] https://github.com/apache/beam/pull/26694

As a recipient of an email from the Talend Group, your personal data will be 
processed by our systems. Please see our Privacy Notice 
 for more information about our 
collection and use of your personal information, our security practices, and 
your data protection rights, including any rights you may have to object to 
automated-decision making or profiling we use to analyze support or marketing 
related communications. To manage or discontinue promotional communications, 
use the communication preferences 
portal. To exercise your data 
protection rights, use the privacy request 
form.
 Contact us here  or by mail to either of our 
co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San Mateo, 
CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, France


Beam High Priority Issue Report (32)

2023-05-15 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/26621 [Failing Test]: 
beam_PerformanceTests_SparkReceiver_IO failing
https://github.com/apache/beam/issues/26616 [Failing Test]: 
beam_PostCommit_Java_DataflowV2 SpannerReadIT multiple test failing
https://github.com/apache/beam/issues/26550 [Failing Test]: 
beam_PostCommit_Java_PVR_Spark_Batch
https://github.com/apache/beam/issues/26547 [Failing Test]: 
beam_PostCommit_Java_DataflowV2
https://github.com/apache/beam/issues/26354 [Bug]: BigQueryIO direct read not 
reading all rows when set --setEnableBundling=true
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in 
FlinkRunner leads to a data loss
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19465 Explore possibilities to lower 
in-use IP address quota footprint.


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey