Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Chamikara Jayalath via dev
Nvm, I was running the Kafka cluster and the job in two different projects.
It's working as expected.

+1 (binding) for the release.

Thanks,
Cham


On Tue, May 30, 2023 at 6:09 PM Chamikara Jayalath 
wrote:

> I'm seeing a potential regression when running Python x-lang Kafka jobs on
> Datafllow.
>
>
> https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-30_16_31_32-1219154560944228293;step=;mainTab=JOB_GRAPH;bottomTab=JOB_LOGS;logsSeverity=INFO;graphView=0?project=google.com:clouddfe=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))
>
> "Topic kafka_taxirides_realtime not present in metadata after 6 ms"
>
> Currently not sure if this is due to my Kafka cluster setup or not.
>
> Thanks,
> Cham
>
>
>
>
>
> On Tue, May 30, 2023 at 5:52 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1 (binding)
>>
>> On Tue, May 30, 2023 at 5:42 PM Robert Bradshaw 
>> wrote:
>>
>>> On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Thanks Danny and Jack! Dataflow containers are up!

 Only PMC votes count but feel free to test your use cases and vote on
 this thread!

>>>
>>> While we need at least 3 affirmative PMC votes to formally do a release,
>>> it is definitely the case that all votes are valuable input and are taken
>>> into consideration when deciding to do so.
>>>
>>>
 On Tue, May 30, 2023 at 11:26 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> +1 (binding)
>
> Tested with  https://github.com/Talend/beam-samples/
> (Java SDK v8/v11/v17, Spark 3.x runner).
>
> On 27 May 2023, at 19:38, Bruno Volpato via dev 
> wrote:
>
> I was able to check that containers are all there and complete
> my validation.
>
> +1 (non-binding).
>
> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java
> SDK 11, Dataflow runner).
>
>
> Thanks Ritesh and Danny!
>
> On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> It looks like some Dataflow containers didn't get published, so some
>> jobs using the legacy runner (runner v2 disabled) will fail. I kicked off
>> the container release, so that should hopefully be available later today.
>>
>> Thanks,
>> Danny
>>
>> On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #2 for the version
>>> 2.48.0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> Reviewers are encouraged to test their own use cases with the
>>> release candidate, and vote +1 if no issues are found. Only PMC member
>>> votes will count towards the final vote, but votes from all community
>>> members are encouraged and helpful for finding regressions; you can 
>>> either
>>> test your own use cases or use cases from the validation sheet [10].
>>>
>>> The complete staging area is available for your review, which
>>> includes:
>>> * GitHub Release notes [1],
>>> * the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with fingerprint
>>> E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.48.0-RC2" [5],
>>> * website pull request listing the release [6], the blog post [6],
>>> and publishing the API reference manual [7] (to be generated).
>>> * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK
>>> 8.0.322.
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2] and PyPI[8].
>>> * Go artifacts and documentation are available at pkg.go.dev [9]
>>> * Validation sheet with a tab for 2.48.0 release to help with
>>> validation [10].
>>> * Docker images published to Docker Hub [11].
>>> * PR to run tests against release branch [12].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>>
>>> For guidelines on how to try the release in your projects, check out
>>> our blog post at /blog/validate-beam-release/.
>>>
>>> *NOTE: Dataflow containers for Python are not finalized yet (likely
>>> to happen on tuesday). I will follow up on this thread once that is 
>>> done.
>>> Feel free to test it on other runners until then. *
>>>
>>> Thanks,
>>> Ritesh Ghorse
>>>
>>> [1] https://github.com/apache/beam/milestone/12
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> 

Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Byron Ellis via dev
Sure, I get that… though perhaps we should consider just going to something
Avro for portable coding rather than something custom.

On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath 
wrote:

> Input/output PCollection types at least have to be portable Beam types [1]
> for cross-language to work.
>
> I think we restricted schema-aware transforms to PCollection since
> Row was expected to be an efficient replacement for arbitrary portable Beam
> types (not sure how true that is in practice currently).
>
> Thanks,
> Cham
>
> [1]
> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>
> On Tue, May 30, 2023 at 1:54 PM Byron Ellis  wrote:
>
>> Is it actually necessary for a PTransform that is configured via the
>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>> separate concerns and unnecessarily limiting.
>>
>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath 
>> wrote:
>>
>>> +1 for the simplification.
>>>
>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw 
>>> wrote:
>>>
 Yeah. Essentially one needs do (1) name the arguments and (2) implement
 the transform. Hopefully (1) could be done in a concise way that allows for
 easy consumption from both Java and cross-language.

>>>
>>> +1 but I think the hard part today is to convert existing PTransforms to
>>> be schema-aware transform compatible (for example, change input/output
>>> types and make sure parameters take Beam Schema compatible types). But this
>>> makes sense for new transforms.
>>>
>>>
>>>
 On Tue, May 30, 2023 at 12:25 PM Byron Ellis 
 wrote:

> Or perhaps the other way around? If you have a Schema we can
> auto-generate the associated builder on the PTransform? Either way, more
> DRY.
>
> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to this simplification, it's a historical artifact that provides
>> no value.
>>
>> I would love it if we also looked into ways to auto-generate the
>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>> small number of arguments, or uses the standard builder pattern...),
>> ideally with something as simple as adding a decorator to the PTransform
>> itself.
>>
>>
>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey everyone,
>>>
>>> I was looking at how we use SchemaTransforms in our expansion
>>> service. From what I see, there may be a redundant step in developing
>>> SchemaTransforms. Currently, we have 3 pieces:
>>> - SchemaTransformProvider [1]
>>> - A configuration object
>>> - SchemaTransform [2]
>>>
>>> The API is generally used like this:
>>> 1. The SchemaTransformProvider takes a configuration object and
>>> returns a SchemaTransform
>>> 2. The SchemaTransform is used to build a PTransform according to
>>> the configuration
>>>
>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>> combine the two steps if we have SchemaTransformProvider return the
>>> PTransform directly.
>>>
>>> We can then remove the SchemaTransform class as it will be obsolete.
>>> This should be safe to do; the only place it's used in our API is here 
>>> [3],
>>> and that can be simplified if we make this change (we'd just trim `
>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>> directly return the PTransform).
>>>
>>> I'd like to first mention that I was not involved in the design
>>> process of this API so I may be missing some information on why it was 
>>> set
>>> up this way.
>>>
>>> A few developers already raised questions about how there's
>>> seemingly unnecessary boilerplate involved in making a Java transform
>>> portable. I wasn't involved in the design process of this API so I may 
>>> be
>>> missing some information, but my assumption is this was designed to 
>>> follow
>>> the pattern of the previous iteration of this API (SchemaIO):
>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with the
>>> newer SchemaTransformProvider API, we dropped a few methods and reduced 
>>> the
>>> SchemaTransform class to have a generic buildTransform() method. See the
>>> example of PubsubReadSchemaTransformProvider [6], where the
>>> SchemaTransform interface and buildTransform method are implemented
>>> just to satisfy the requirement that SchemaTransformProvider::from
>>> return a SchemaTransform.
>>>
>>> I'm bringing this up because if we are looking to encourage
>>> contribution to cross-language use cases, we should make it simpler and
>>> less convoluted to develop portable transforms.
>>>
>>> 

Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Chamikara Jayalath via dev
I'm seeing a potential regression when running Python x-lang Kafka jobs on
Datafllow.

https://pantheon.corp.google.com/dataflow/jobs/us-central1/2023-05-30_16_31_32-1219154560944228293;step=;mainTab=JOB_GRAPH;bottomTab=JOB_LOGS;logsSeverity=INFO;graphView=0?project=google.com:clouddfe=(%22dfTime%22:(%22l%22:%22dfJobMaxTime%22))

"Topic kafka_taxirides_realtime not present in metadata after 6 ms"

Currently not sure if this is due to my Kafka cluster setup or not.

Thanks,
Cham





On Tue, May 30, 2023 at 5:52 PM Robert Bradshaw via dev 
wrote:

> +1 (binding)
>
> On Tue, May 30, 2023 at 5:42 PM Robert Bradshaw 
> wrote:
>
>> On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Thanks Danny and Jack! Dataflow containers are up!
>>>
>>> Only PMC votes count but feel free to test your use cases and vote on
>>> this thread!
>>>
>>
>> While we need at least 3 affirmative PMC votes to formally do a release,
>> it is definitely the case that all votes are valuable input and are taken
>> into consideration when deciding to do so.
>>
>>
>>> On Tue, May 30, 2023 at 11:26 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 +1 (binding)

 Tested with  https://github.com/Talend/beam-samples/
 (Java SDK v8/v11/v17, Spark 3.x runner).

 On 27 May 2023, at 19:38, Bruno Volpato via dev 
 wrote:

 I was able to check that containers are all there and complete
 my validation.

 +1 (non-binding).

 Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java
 SDK 11, Dataflow runner).


 Thanks Ritesh and Danny!

 On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> It looks like some Dataflow containers didn't get published, so some
> jobs using the legacy runner (runner v2 disabled) will fail. I kicked off
> the container release, so that should hopefully be available later today.
>
> Thanks,
> Danny
>
> On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev <
> dev@beam.apache.org> wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #2 for the version
>> 2.48.0, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> Reviewers are encouraged to test their own use cases with the release
>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>> count towards the final vote, but votes from all community members are
>> encouraged and helpful for finding regressions; you can either test your
>> own use cases or use cases from the validation sheet [10].
>>
>> The complete staging area is available for your review, which
>> includes:
>> * GitHub Release notes [1],
>> * the official Apache source release to be deployed to
>> dist.apache.org [2], which is signed with the key with fingerprint
>> E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.48.0-RC2" [5],
>> * website pull request listing the release [6], the blog post [6],
>> and publishing the API reference manual [7] (to be generated).
>> * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK
>> 8.0.322.
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2] and PyPI[8].
>> * Go artifacts and documentation are available at pkg.go.dev [9]
>> * Validation sheet with a tab for 2.48.0 release to help with
>> validation [10].
>> * Docker images published to Docker Hub [11].
>> * PR to run tests against release branch [12].
>>
>> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>>
>> For guidelines on how to try the release in your projects, check out
>> our blog post at /blog/validate-beam-release/.
>>
>> *NOTE: Dataflow containers for Python are not finalized yet (likely
>> to happen on tuesday). I will follow up on this thread once that is done.
>> Feel free to test it on other runners until then. *
>>
>> Thanks,
>> Ritesh Ghorse
>>
>> [1] https://github.com/apache/beam/milestone/12
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1346/
>> [5] https://github.com/apache/beam/tree/v2.48.0-RC2
>> [6] https://github.com/apache/beam/pull/26903
>> [7] https://github.com/apache/beam-site/pull/645
>> [8] https://pypi.org/project/apache-beam/2.48.0rc2/
>> [9]
>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.48.0-RC2/go/pkg/beam
>> 

Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Robert Bradshaw via dev
+1 (binding)

On Tue, May 30, 2023 at 5:42 PM Robert Bradshaw  wrote:

> On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev 
> wrote:
>
>> Thanks Danny and Jack! Dataflow containers are up!
>>
>> Only PMC votes count but feel free to test your use cases and vote on
>> this thread!
>>
>
> While we need at least 3 affirmative PMC votes to formally do a release,
> it is definitely the case that all votes are valuable input and are taken
> into consideration when deciding to do so.
>
>
>> On Tue, May 30, 2023 at 11:26 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> +1 (binding)
>>>
>>> Tested with  https://github.com/Talend/beam-samples/
>>> (Java SDK v8/v11/v17, Spark 3.x runner).
>>>
>>> On 27 May 2023, at 19:38, Bruno Volpato via dev 
>>> wrote:
>>>
>>> I was able to check that containers are all there and complete
>>> my validation.
>>>
>>> +1 (non-binding).
>>>
>>> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java
>>> SDK 11, Dataflow runner).
>>>
>>>
>>> Thanks Ritesh and Danny!
>>>
>>> On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 It looks like some Dataflow containers didn't get published, so some
 jobs using the legacy runner (runner v2 disabled) will fail. I kicked off
 the container release, so that should hopefully be available later today.

 Thanks,
 Danny

 On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev <
 dev@beam.apache.org> wrote:

> Hi everyone,
> Please review and vote on the release candidate #2 for the version
> 2.48.0, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if no issues are found. Only PMC member votes will
> count towards the final vote, but votes from all community members are
> encouraged and helpful for finding regressions; you can either test your
> own use cases or use cases from the validation sheet [10].
>
> The complete staging area is available for your review, which includes:
> * GitHub Release notes [1],
> * the official Apache source release to be deployed to dist.apache.org 
> [2],
> which is signed with the key with fingerprint
> E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.48.0-RC2" [5],
> * website pull request listing the release [6], the blog post [6], and
> publishing the API reference manual [7] (to be generated).
> * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK
> 8.0.322.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2] and PyPI[8].
> * Go artifacts and documentation are available at pkg.go.dev [9]
> * Validation sheet with a tab for 2.48.0 release to help with
> validation [10].
> * Docker images published to Docker Hub [11].
> * PR to run tests against release branch [12].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> For guidelines on how to try the release in your projects, check out
> our blog post at /blog/validate-beam-release/.
>
> *NOTE: Dataflow containers for Python are not finalized yet (likely to
> happen on tuesday). I will follow up on this thread once that is done. 
> Feel
> free to test it on other runners until then. *
>
> Thanks,
> Ritesh Ghorse
>
> [1] https://github.com/apache/beam/milestone/12
> [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1346/
> [5] https://github.com/apache/beam/tree/v2.48.0-RC2
> [6] https://github.com/apache/beam/pull/26903
> [7] https://github.com/apache/beam-site/pull/645
> [8] https://pypi.org/project/apache-beam/2.48.0rc2/
> [9]
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.48.0-RC2/go/pkg/beam
> 
> [10]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=458120434
> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
> [12] https://github.com/apache/beam/pull/26811
>
>
>>>


Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Robert Bradshaw via dev
On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev 
wrote:

> Thanks Danny and Jack! Dataflow containers are up!
>
> Only PMC votes count but feel free to test your use cases and vote on this
> thread!
>

While we need at least 3 affirmative PMC votes to formally do a release, it
is definitely the case that all votes are valuable input and are taken into
consideration when deciding to do so.


> On Tue, May 30, 2023 at 11:26 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> +1 (binding)
>>
>> Tested with  https://github.com/Talend/beam-samples/
>> (Java SDK v8/v11/v17, Spark 3.x runner).
>>
>> On 27 May 2023, at 19:38, Bruno Volpato via dev 
>> wrote:
>>
>> I was able to check that containers are all there and complete
>> my validation.
>>
>> +1 (non-binding).
>>
>> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java
>> SDK 11, Dataflow runner).
>>
>>
>> Thanks Ritesh and Danny!
>>
>> On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> It looks like some Dataflow containers didn't get published, so some
>>> jobs using the legacy runner (runner v2 disabled) will fail. I kicked off
>>> the container release, so that should hopefully be available later today.
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hi everyone,
 Please review and vote on the release candidate #2 for the version
 2.48.0, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)


 Reviewers are encouraged to test their own use cases with the release
 candidate, and vote +1 if no issues are found. Only PMC member votes will
 count towards the final vote, but votes from all community members are
 encouraged and helpful for finding regressions; you can either test your
 own use cases or use cases from the validation sheet [10].

 The complete staging area is available for your review, which includes:
 * GitHub Release notes [1],
 * the official Apache source release to be deployed to dist.apache.org [2],
 which is signed with the key with fingerprint
 E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.48.0-RC2" [5],
 * website pull request listing the release [6], the blog post [6], and
 publishing the API reference manual [7] (to be generated).
 * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK
 8.0.322.
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2] and PyPI[8].
 * Go artifacts and documentation are available at pkg.go.dev [9]
 * Validation sheet with a tab for 2.48.0 release to help with
 validation [10].
 * Docker images published to Docker Hub [11].
 * PR to run tests against release branch [12].

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 For guidelines on how to try the release in your projects, check out
 our blog post at /blog/validate-beam-release/.

 *NOTE: Dataflow containers for Python are not finalized yet (likely to
 happen on tuesday). I will follow up on this thread once that is done. Feel
 free to test it on other runners until then. *

 Thanks,
 Ritesh Ghorse

 [1] https://github.com/apache/beam/milestone/12
 [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1346/
 [5] https://github.com/apache/beam/tree/v2.48.0-RC2
 [6] https://github.com/apache/beam/pull/26903
 [7] https://github.com/apache/beam-site/pull/645
 [8] https://pypi.org/project/apache-beam/2.48.0rc2/
 [9]
 https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.48.0-RC2/go/pkg/beam
 
 [10]
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=458120434
 [11] https://hub.docker.com/search?q=apache%2Fbeam=image
 [12] https://github.com/apache/beam/pull/26811


>>


Re: Client-Side Throttling in Apache Beam

2023-05-30 Thread Ahmet Altay via dev
Thank you. +1 to adding this to wiki.

FYI - @Damon Douglas  shared a related doc earlier
for alternative ideas for Beam to prevent overloading external services. (
https://docs.google.com/document/d/1VZ9YphDO7kewBSz5oMXVPHWaib3S03Z6aZ66BhciB3E/edit?usp=sharing=0-ItxMSG72EzfSwVedSz-Zeg
)

On Tue, May 30, 2023 at 3:52 PM Robert Burke  wrote:

> Great article!
>
> Though it's depressing to see we have a pair of magic counter names to
> help modulate scaling behavior.
>
> On Tue, May 30, 2023, 11:42 AM Jack McCluskey via dev 
> wrote:
>
>> Hey everyone,
>>
>> While working on some remote model handler code I hit a point where I
>> needed to understand how Beam IOs interpret and action on being throttled
>> by an external service. This turned into a few discussions and then a small
>> write-up doc (
>> https://docs.google.com/document/d/1ePorJGZnLbNCmLD9mR7iFYOdPsyDA1rDnTpYnbdrzSU/edit?usp=sharing)
>> to encapsulate the basics of what I learned. If you're familiar with this
>> topic feel free to make suggestions on the doc, I'm intending to add this
>> to the wiki so there's a resource for how this works in the future!
>>
>> Thanks,
>>
>> Jack McCluskey
>>
>> --
>>
>>
>> Jack McCluskey
>> SWE - DataPLS PLAT/ Dataflow ML
>> RDU
>> jrmcclus...@google.com
>>
>>
>>


Re: Client-Side Throttling in Apache Beam

2023-05-30 Thread Robert Burke
Great article!

Though it's depressing to see we have a pair of magic counter names to help
modulate scaling behavior.

On Tue, May 30, 2023, 11:42 AM Jack McCluskey via dev 
wrote:

> Hey everyone,
>
> While working on some remote model handler code I hit a point where I
> needed to understand how Beam IOs interpret and action on being throttled
> by an external service. This turned into a few discussions and then a small
> write-up doc (
> https://docs.google.com/document/d/1ePorJGZnLbNCmLD9mR7iFYOdPsyDA1rDnTpYnbdrzSU/edit?usp=sharing)
> to encapsulate the basics of what I learned. If you're familiar with this
> topic feel free to make suggestions on the doc, I'm intending to add this
> to the wiki so there's a resource for how this works in the future!
>
> Thanks,
>
> Jack McCluskey
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Dataflow ML
> RDU
> jrmcclus...@google.com
>
>
>


Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Reuven Lax via dev
+1 (binding)

On Tue, May 30, 2023 at 2:43 PM Ahmet Altay via dev 
wrote:

> +1 (binding)
>
> On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev 
> wrote:
>
>> Thanks Danny and Jack! Dataflow containers are up!
>>
>> Only PMC votes count but feel free to test your use cases and vote on
>> this thread!
>>
>> On Tue, May 30, 2023 at 11:26 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> +1 (binding)
>>>
>>> Tested with  https://github.com/Talend/beam-samples/
>>> (Java SDK v8/v11/v17, Spark 3.x runner).
>>>
>>> On 27 May 2023, at 19:38, Bruno Volpato via dev 
>>> wrote:
>>>
>>> I was able to check that containers are all there and complete
>>> my validation.
>>>
>>> +1 (non-binding).
>>>
>>> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java
>>> SDK 11, Dataflow runner).
>>>
>>>
>>> Thanks Ritesh and Danny!
>>>
>>> On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 It looks like some Dataflow containers didn't get published, so some
 jobs using the legacy runner (runner v2 disabled) will fail. I kicked off
 the container release, so that should hopefully be available later today.

 Thanks,
 Danny

 On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev <
 dev@beam.apache.org> wrote:

> Hi everyone,
> Please review and vote on the release candidate #2 for the version
> 2.48.0, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if no issues are found. Only PMC member votes will
> count towards the final vote, but votes from all community members are
> encouraged and helpful for finding regressions; you can either test your
> own use cases or use cases from the validation sheet [10].
>
> The complete staging area is available for your review, which includes:
> * GitHub Release notes [1],
> * the official Apache source release to be deployed to dist.apache.org 
> [2],
> which is signed with the key with fingerprint
> E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.48.0-RC2" [5],
> * website pull request listing the release [6], the blog post [6], and
> publishing the API reference manual [7] (to be generated).
> * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK
> 8.0.322.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2] and PyPI[8].
> * Go artifacts and documentation are available at pkg.go.dev [9]
> * Validation sheet with a tab for 2.48.0 release to help with
> validation [10].
> * Docker images published to Docker Hub [11].
> * PR to run tests against release branch [12].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> For guidelines on how to try the release in your projects, check out
> our blog post at /blog/validate-beam-release/.
>
> *NOTE: Dataflow containers for Python are not finalized yet (likely to
> happen on tuesday). I will follow up on this thread once that is done. 
> Feel
> free to test it on other runners until then. *
>
> Thanks,
> Ritesh Ghorse
>
> [1] https://github.com/apache/beam/milestone/12
> [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1346/
> [5] https://github.com/apache/beam/tree/v2.48.0-RC2
> [6] https://github.com/apache/beam/pull/26903
> [7] https://github.com/apache/beam-site/pull/645
> [8] https://pypi.org/project/apache-beam/2.48.0rc2/
> [9]
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.48.0-RC2/go/pkg/beam
> 
> [10]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=458120434
> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
> [12] https://github.com/apache/beam/pull/26811
>
>
>>>


Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Ahmet Altay via dev
+1 (binding)

On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev 
wrote:

> Thanks Danny and Jack! Dataflow containers are up!
>
> Only PMC votes count but feel free to test your use cases and vote on this
> thread!
>
> On Tue, May 30, 2023 at 11:26 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
>
>> +1 (binding)
>>
>> Tested with  https://github.com/Talend/beam-samples/
>> (Java SDK v8/v11/v17, Spark 3.x runner).
>>
>> On 27 May 2023, at 19:38, Bruno Volpato via dev 
>> wrote:
>>
>> I was able to check that containers are all there and complete
>> my validation.
>>
>> +1 (non-binding).
>>
>> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java
>> SDK 11, Dataflow runner).
>>
>>
>> Thanks Ritesh and Danny!
>>
>> On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> It looks like some Dataflow containers didn't get published, so some
>>> jobs using the legacy runner (runner v2 disabled) will fail. I kicked off
>>> the container release, so that should hopefully be available later today.
>>>
>>> Thanks,
>>> Danny
>>>
>>> On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hi everyone,
 Please review and vote on the release candidate #2 for the version
 2.48.0, as follows:
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)


 Reviewers are encouraged to test their own use cases with the release
 candidate, and vote +1 if no issues are found. Only PMC member votes will
 count towards the final vote, but votes from all community members are
 encouraged and helpful for finding regressions; you can either test your
 own use cases or use cases from the validation sheet [10].

 The complete staging area is available for your review, which includes:
 * GitHub Release notes [1],
 * the official Apache source release to be deployed to dist.apache.org [2],
 which is signed with the key with fingerprint
 E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.48.0-RC2" [5],
 * website pull request listing the release [6], the blog post [6], and
 publishing the API reference manual [7] (to be generated).
 * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK
 8.0.322.
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2] and PyPI[8].
 * Go artifacts and documentation are available at pkg.go.dev [9]
 * Validation sheet with a tab for 2.48.0 release to help with
 validation [10].
 * Docker images published to Docker Hub [11].
 * PR to run tests against release branch [12].

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 For guidelines on how to try the release in your projects, check out
 our blog post at /blog/validate-beam-release/.

 *NOTE: Dataflow containers for Python are not finalized yet (likely to
 happen on tuesday). I will follow up on this thread once that is done. Feel
 free to test it on other runners until then. *

 Thanks,
 Ritesh Ghorse

 [1] https://github.com/apache/beam/milestone/12
 [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1346/
 [5] https://github.com/apache/beam/tree/v2.48.0-RC2
 [6] https://github.com/apache/beam/pull/26903
 [7] https://github.com/apache/beam-site/pull/645
 [8] https://pypi.org/project/apache-beam/2.48.0rc2/
 [9]
 https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.48.0-RC2/go/pkg/beam
 
 [10]
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=458120434
 [11] https://hub.docker.com/search?q=apache%2Fbeam=image
 [12] https://github.com/apache/beam/pull/26811


>>


Don't miss Beam Summit in NYC this year!

2023-05-30 Thread Carolina Escobar
*Stay on top of the Data Science trends by attending Beam Summit!*


*Take a quick peek at our speaker line-up:*
*Here are some of our sessions:*

   -

   *Cross-language JdbcIO enabled by Beam portable schemas*
   Yi Hu shares the experience of the development and use cases enabled by
   the significant improvement in both features and performance in Beam
   v2.42.0, Beam Python SDK's JdbcIO, and the underlying Beam portable schema.
   -

   *Use Apache Beam to build Machine Learning Feature System at Affirm*
   Hao Xu explores how Affirm uses Apache Beam to build a unified
   transformation engine within a machine-learning feature store,
   demonstrating how to leverage Apache Beam to enable machine learning in
   various business cases, such as fraud detection, underwriting, and growth
   -

   *Meeting Security Requirements for Apache Beam Pipelines on Google Cloud*
   Lorenzo Caggioni identifies a reference architecture to accomplish
   requirements such as role separation and least privileges, private
   resources, and encryption of customer data on a Google Cloud Platform
   environment.
   -

   *Oops I *actually* wrote a Portable Beam Runner in Go*
   Robert Burke, through a demo, code walkthrough, and testing advice,
   covers what the new runner can do and why you might find it useful.
   -

   *Simplifying Speech-to-Text Processing with Apache Beam and Redis*
   Pramod Rao and Prateek Sheel will present their design journey to solve
   a complex speech-to-text processing problem in Apache Beam by leveraging
   Redis, a simple and efficient external persistent state
   -

   *Introduction to Clustering in Apache Beam*
   Jasper Van den Bossche dives into all the steps involved in one of the
   big challenges when working with per entity: training. These steps may
   include: Data ingestion from different sources, processing data in various
   formats and varying quality, inference of the different models or post
   processing the results so they can be presented to the end user in a
   waythat is easy to understand.
   -

   *Developing (experimental) Rust SDKs and a Beam engine for IoT devices*
   Sho Nakatani proposes the making of SpringQL as a Beam engine and
   discusses the work done to change SpringQL's API from incomplete streaming
   SQL to the Beam Model


Check out the Program 
Register Now 
[image: Tw]  [image: Yt]
 [image: In]



Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Chamikara Jayalath via dev
Input/output PCollection types at least have to be portable Beam types [1]
for cross-language to work.

I think we restricted schema-aware transforms to PCollection since Row
was expected to be an efficient replacement for arbitrary portable Beam
types (not sure how true that is in practice currently).

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829

On Tue, May 30, 2023 at 1:54 PM Byron Ellis  wrote:

> Is it actually necessary for a PTransform that is configured via the
> Schema mechanism to also be one that uses RowCoder? Those strike me as two
> separate concerns and unnecessarily limiting.
>
> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath 
> wrote:
>
>> +1 for the simplification.
>>
>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw 
>> wrote:
>>
>>> Yeah. Essentially one needs do (1) name the arguments and (2) implement
>>> the transform. Hopefully (1) could be done in a concise way that allows for
>>> easy consumption from both Java and cross-language.
>>>
>>
>> +1 but I think the hard part today is to convert existing PTransforms to
>> be schema-aware transform compatible (for example, change input/output
>> types and make sure parameters take Beam Schema compatible types). But this
>> makes sense for new transforms.
>>
>>
>>
>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis 
>>> wrote:
>>>
 Or perhaps the other way around? If you have a Schema we can
 auto-generate the associated builder on the PTransform? Either way, more
 DRY.

 On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
 dev@beam.apache.org> wrote:

> +1 to this simplification, it's a historical artifact that provides no
> value.
>
> I would love it if we also looked into ways to auto-generate the
> SchemaTransformProvider (e.g. via introspection if a transform takes a
> small number of arguments, or uses the standard builder pattern...),
> ideally with something as simple as adding a decorator to the PTransform
> itself.
>
>
> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
> dev@beam.apache.org> wrote:
>
>> Hey everyone,
>>
>> I was looking at how we use SchemaTransforms in our expansion
>> service. From what I see, there may be a redundant step in developing
>> SchemaTransforms. Currently, we have 3 pieces:
>> - SchemaTransformProvider [1]
>> - A configuration object
>> - SchemaTransform [2]
>>
>> The API is generally used like this:
>> 1. The SchemaTransformProvider takes a configuration object and
>> returns a SchemaTransform
>> 2. The SchemaTransform is used to build a PTransform according to the
>> configuration
>>
>> In these steps, the SchemaTransform class seems unnecessary. We can
>> combine the two steps if we have SchemaTransformProvider return the
>> PTransform directly.
>>
>> We can then remove the SchemaTransform class as it will be obsolete.
>> This should be safe to do; the only place it's used in our API is here 
>> [3],
>> and that can be simplified if we make this change (we'd just trim `
>> .buildTransform()` off the end as `provider.from(configRow)` will
>> directly return the PTransform).
>>
>> I'd like to first mention that I was not involved in the design
>> process of this API so I may be missing some information on why it was 
>> set
>> up this way.
>>
>> A few developers already raised questions about how there's seemingly
>> unnecessary boilerplate involved in making a Java transform portable. I
>> wasn't involved in the design process of this API so I may be missing 
>> some
>> information, but my assumption is this was designed to follow the pattern
>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>> SchemaIO[5] -> PTransform. However, with the newer
>> SchemaTransformProvider API, we dropped a few methods and reduced the
>> SchemaTransform class to have a generic buildTransform() method. See the
>> example of PubsubReadSchemaTransformProvider [6], where the
>> SchemaTransform interface and buildTransform method are implemented
>> just to satisfy the requirement that SchemaTransformProvider::from
>> return a SchemaTransform.
>>
>> I'm bringing this up because if we are looking to encourage
>> contribution to cross-language use cases, we should make it simpler and
>> less convoluted to develop portable transforms.
>>
>> There are a number of SchemaTransforms already developed, but
>> applying these changes to them should be straightforward. If people think
>> this is a good idea, I can open a PR and implement them.
>>
>> Best,
>> Ahmed
>>
>> [1]
>> 

Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Ritesh Ghorse via dev
Thanks Danny and Jack! Dataflow containers are up!

Only PMC votes count but feel free to test your use cases and vote on this
thread!

On Tue, May 30, 2023 at 11:26 AM Alexey Romanenko 
wrote:

> +1 (binding)
>
> Tested with  https://github.com/Talend/beam-samples/
> (Java SDK v8/v11/v17, Spark 3.x runner).
>
> On 27 May 2023, at 19:38, Bruno Volpato via dev 
> wrote:
>
> I was able to check that containers are all there and complete
> my validation.
>
> +1 (non-binding).
>
> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java
> SDK 11, Dataflow runner).
>
>
> Thanks Ritesh and Danny!
>
> On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> It looks like some Dataflow containers didn't get published, so some jobs
>> using the legacy runner (runner v2 disabled) will fail. I kicked off the
>> container release, so that should hopefully be available later today.
>>
>> Thanks,
>> Danny
>>
>> On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #2 for the version
>>> 2.48.0, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> Reviewers are encouraged to test their own use cases with the release
>>> candidate, and vote +1 if no issues are found. Only PMC member votes will
>>> count towards the final vote, but votes from all community members are
>>> encouraged and helpful for finding regressions; you can either test your
>>> own use cases or use cases from the validation sheet [10].
>>>
>>> The complete staging area is available for your review, which includes:
>>> * GitHub Release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org [2],
>>> which is signed with the key with fingerprint
>>> E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.48.0-RC2" [5],
>>> * website pull request listing the release [6], the blog post [6], and
>>> publishing the API reference manual [7] (to be generated).
>>> * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK
>>> 8.0.322.
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2] and PyPI[8].
>>> * Go artifacts and documentation are available at pkg.go.dev [9]
>>> * Validation sheet with a tab for 2.48.0 release to help with validation
>>> [10].
>>> * Docker images published to Docker Hub [11].
>>> * PR to run tests against release branch [12].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> For guidelines on how to try the release in your projects, check out our
>>> blog post at /blog/validate-beam-release/.
>>>
>>> *NOTE: Dataflow containers for Python are not finalized yet (likely to
>>> happen on tuesday). I will follow up on this thread once that is done. Feel
>>> free to test it on other runners until then. *
>>>
>>> Thanks,
>>> Ritesh Ghorse
>>>
>>> [1] https://github.com/apache/beam/milestone/12
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1346/
>>> [5] https://github.com/apache/beam/tree/v2.48.0-RC2
>>> [6] https://github.com/apache/beam/pull/26903
>>> [7] https://github.com/apache/beam-site/pull/645
>>> [8] https://pypi.org/project/apache-beam/2.48.0rc2/
>>> [9]
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.48.0-RC2/go/pkg/beam
>>> 
>>> [10]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=458120434
>>> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
>>> [12] https://github.com/apache/beam/pull/26811
>>>
>>>
>


Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Byron Ellis via dev
Is it actually necessary for a PTransform that is configured via the Schema
mechanism to also be one that uses RowCoder? Those strike me as two
separate concerns and unnecessarily limiting.

On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath 
wrote:

> +1 for the simplification.
>
> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw 
> wrote:
>
>> Yeah. Essentially one needs do (1) name the arguments and (2) implement
>> the transform. Hopefully (1) could be done in a concise way that allows for
>> easy consumption from both Java and cross-language.
>>
>
> +1 but I think the hard part today is to convert existing PTransforms to
> be schema-aware transform compatible (for example, change input/output
> types and make sure parameters take Beam Schema compatible types). But this
> makes sense for new transforms.
>
>
>
>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis 
>> wrote:
>>
>>> Or perhaps the other way around? If you have a Schema we can
>>> auto-generate the associated builder on the PTransform? Either way, more
>>> DRY.
>>>
>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1 to this simplification, it's a historical artifact that provides no
 value.

 I would love it if we also looked into ways to auto-generate the
 SchemaTransformProvider (e.g. via introspection if a transform takes a
 small number of arguments, or uses the standard builder pattern...),
 ideally with something as simple as adding a decorator to the PTransform
 itself.


 On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
 dev@beam.apache.org> wrote:

> Hey everyone,
>
> I was looking at how we use SchemaTransforms in our expansion service.
> From what I see, there may be a redundant step in developing
> SchemaTransforms. Currently, we have 3 pieces:
> - SchemaTransformProvider [1]
> - A configuration object
> - SchemaTransform [2]
>
> The API is generally used like this:
> 1. The SchemaTransformProvider takes a configuration object and
> returns a SchemaTransform
> 2. The SchemaTransform is used to build a PTransform according to the
> configuration
>
> In these steps, the SchemaTransform class seems unnecessary. We can
> combine the two steps if we have SchemaTransformProvider return the
> PTransform directly.
>
> We can then remove the SchemaTransform class as it will be obsolete.
> This should be safe to do; the only place it's used in our API is here 
> [3],
> and that can be simplified if we make this change (we'd just trim `
> .buildTransform()` off the end as `provider.from(configRow)` will
> directly return the PTransform).
>
> I'd like to first mention that I was not involved in the design
> process of this API so I may be missing some information on why it was set
> up this way.
>
> A few developers already raised questions about how there's seemingly
> unnecessary boilerplate involved in making a Java transform portable. I
> wasn't involved in the design process of this API so I may be missing some
> information, but my assumption is this was designed to follow the pattern
> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
> SchemaIO[5] -> PTransform. However, with the newer
> SchemaTransformProvider API, we dropped a few methods and reduced the
> SchemaTransform class to have a generic buildTransform() method. See the
> example of PubsubReadSchemaTransformProvider [6], where the
> SchemaTransform interface and buildTransform method are implemented
> just to satisfy the requirement that SchemaTransformProvider::from
> return a SchemaTransform.
>
> I'm bringing this up because if we are looking to encourage
> contribution to cross-language use cases, we should make it simpler and
> less convoluted to develop portable transforms.
>
> There are a number of SchemaTransforms already developed, but applying
> these changes to them should be straightforward. If people think this is a
> good idea, I can open a PR and implement them.
>
> Best,
> Ahmed
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
> [3]
> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
> [5]
> 

Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Chamikara Jayalath via dev
+1 for the simplification.

On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw 
wrote:

> Yeah. Essentially one needs do (1) name the arguments and (2) implement
> the transform. Hopefully (1) could be done in a concise way that allows for
> easy consumption from both Java and cross-language.
>

+1 but I think the hard part today is to convert existing PTransforms to be
schema-aware transform compatible (for example, change input/output types
and make sure parameters take Beam Schema compatible types). But this makes
sense for new transforms.



> On Tue, May 30, 2023 at 12:25 PM Byron Ellis 
> wrote:
>
>> Or perhaps the other way around? If you have a Schema we can
>> auto-generate the associated builder on the PTransform? Either way, more
>> DRY.
>>
>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 to this simplification, it's a historical artifact that provides no
>>> value.
>>>
>>> I would love it if we also looked into ways to auto-generate the
>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>> small number of arguments, or uses the standard builder pattern...),
>>> ideally with something as simple as adding a decorator to the PTransform
>>> itself.
>>>
>>>
>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Hey everyone,

 I was looking at how we use SchemaTransforms in our expansion service.
 From what I see, there may be a redundant step in developing
 SchemaTransforms. Currently, we have 3 pieces:
 - SchemaTransformProvider [1]
 - A configuration object
 - SchemaTransform [2]

 The API is generally used like this:
 1. The SchemaTransformProvider takes a configuration object and returns
 a SchemaTransform
 2. The SchemaTransform is used to build a PTransform according to the
 configuration

 In these steps, the SchemaTransform class seems unnecessary. We can
 combine the two steps if we have SchemaTransformProvider return the
 PTransform directly.

 We can then remove the SchemaTransform class as it will be obsolete.
 This should be safe to do; the only place it's used in our API is here [3],
 and that can be simplified if we make this change (we'd just trim `
 .buildTransform()` off the end as `provider.from(configRow)` will
 directly return the PTransform).

 I'd like to first mention that I was not involved in the design process
 of this API so I may be missing some information on why it was set up this
 way.

 A few developers already raised questions about how there's seemingly
 unnecessary boilerplate involved in making a Java transform portable. I
 wasn't involved in the design process of this API so I may be missing some
 information, but my assumption is this was designed to follow the pattern
 of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
 SchemaIO[5] -> PTransform. However, with the newer
 SchemaTransformProvider API, we dropped a few methods and reduced the
 SchemaTransform class to have a generic buildTransform() method. See the
 example of PubsubReadSchemaTransformProvider [6], where the
 SchemaTransform interface and buildTransform method are implemented
 just to satisfy the requirement that SchemaTransformProvider::from
 return a SchemaTransform.

 I'm bringing this up because if we are looking to encourage
 contribution to cross-language use cases, we should make it simpler and
 less convoluted to develop portable transforms.

 There are a number of SchemaTransforms already developed, but applying
 these changes to them should be straightforward. If people think this is a
 good idea, I can open a PR and implement them.

 Best,
 Ahmed

 [1]
 https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
 [2]
 https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
 [3]
 https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
 [4]
 https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
 [5]
 https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
 [6]
 https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137

>>>


Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Robert Bradshaw via dev
Yeah. Essentially one needs do (1) name the arguments and (2) implement the
transform. Hopefully (1) could be done in a concise way that allows for
easy consumption from both Java and cross-language.

On Tue, May 30, 2023 at 12:25 PM Byron Ellis  wrote:

> Or perhaps the other way around? If you have a Schema we can auto-generate
> the associated builder on the PTransform? Either way, more DRY.
>
> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to this simplification, it's a historical artifact that provides no
>> value.
>>
>> I would love it if we also looked into ways to auto-generate the
>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>> small number of arguments, or uses the standard builder pattern...),
>> ideally with something as simple as adding a decorator to the PTransform
>> itself.
>>
>>
>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey everyone,
>>>
>>> I was looking at how we use SchemaTransforms in our expansion service.
>>> From what I see, there may be a redundant step in developing
>>> SchemaTransforms. Currently, we have 3 pieces:
>>> - SchemaTransformProvider [1]
>>> - A configuration object
>>> - SchemaTransform [2]
>>>
>>> The API is generally used like this:
>>> 1. The SchemaTransformProvider takes a configuration object and returns
>>> a SchemaTransform
>>> 2. The SchemaTransform is used to build a PTransform according to the
>>> configuration
>>>
>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>> combine the two steps if we have SchemaTransformProvider return the
>>> PTransform directly.
>>>
>>> We can then remove the SchemaTransform class as it will be obsolete.
>>> This should be safe to do; the only place it's used in our API is here [3],
>>> and that can be simplified if we make this change (we'd just trim `
>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>> directly return the PTransform).
>>>
>>> I'd like to first mention that I was not involved in the design process
>>> of this API so I may be missing some information on why it was set up this
>>> way.
>>>
>>> A few developers already raised questions about how there's seemingly
>>> unnecessary boilerplate involved in making a Java transform portable. I
>>> wasn't involved in the design process of this API so I may be missing some
>>> information, but my assumption is this was designed to follow the pattern
>>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>>> SchemaIO[5] -> PTransform. However, with the newer
>>> SchemaTransformProvider API, we dropped a few methods and reduced the
>>> SchemaTransform class to have a generic buildTransform() method. See the
>>> example of PubsubReadSchemaTransformProvider [6], where the
>>> SchemaTransform interface and buildTransform method are implemented
>>> just to satisfy the requirement that SchemaTransformProvider::from
>>> return a SchemaTransform.
>>>
>>> I'm bringing this up because if we are looking to encourage contribution
>>> to cross-language use cases, we should make it simpler and less convoluted
>>> to develop portable transforms.
>>>
>>> There are a number of SchemaTransforms already developed, but applying
>>> these changes to them should be straightforward. If people think this is a
>>> good idea, I can open a PR and implement them.
>>>
>>> Best,
>>> Ahmed
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>> [2]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>> [3]
>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>> [4]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>> [5]
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>> [6]
>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>
>>


Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Byron Ellis via dev
Or perhaps the other way around? If you have a Schema we can auto-generate
the associated builder on the PTransform? Either way, more DRY.

On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
dev@beam.apache.org> wrote:

> +1 to this simplification, it's a historical artifact that provides no
> value.
>
> I would love it if we also looked into ways to auto-generate the
> SchemaTransformProvider (e.g. via introspection if a transform takes a
> small number of arguments, or uses the standard builder pattern...),
> ideally with something as simple as adding a decorator to the PTransform
> itself.
>
>
> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
> dev@beam.apache.org> wrote:
>
>> Hey everyone,
>>
>> I was looking at how we use SchemaTransforms in our expansion service.
>> From what I see, there may be a redundant step in developing
>> SchemaTransforms. Currently, we have 3 pieces:
>> - SchemaTransformProvider [1]
>> - A configuration object
>> - SchemaTransform [2]
>>
>> The API is generally used like this:
>> 1. The SchemaTransformProvider takes a configuration object and returns a
>> SchemaTransform
>> 2. The SchemaTransform is used to build a PTransform according to the
>> configuration
>>
>> In these steps, the SchemaTransform class seems unnecessary. We can
>> combine the two steps if we have SchemaTransformProvider return the
>> PTransform directly.
>>
>> We can then remove the SchemaTransform class as it will be obsolete. This
>> should be safe to do; the only place it's used in our API is here [3], and
>> that can be simplified if we make this change (we'd just trim `
>> .buildTransform()` off the end as `provider.from(configRow)` will
>> directly return the PTransform).
>>
>> I'd like to first mention that I was not involved in the design process
>> of this API so I may be missing some information on why it was set up this
>> way.
>>
>> A few developers already raised questions about how there's seemingly
>> unnecessary boilerplate involved in making a Java transform portable. I
>> wasn't involved in the design process of this API so I may be missing some
>> information, but my assumption is this was designed to follow the pattern
>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>> SchemaIO[5] -> PTransform. However, with the newer
>> SchemaTransformProvider API, we dropped a few methods and reduced the
>> SchemaTransform class to have a generic buildTransform() method. See the
>> example of PubsubReadSchemaTransformProvider [6], where the
>> SchemaTransform interface and buildTransform method are implemented just
>> to satisfy the requirement that SchemaTransformProvider::from return a
>> SchemaTransform.
>>
>> I'm bringing this up because if we are looking to encourage contribution
>> to cross-language use cases, we should make it simpler and less convoluted
>> to develop portable transforms.
>>
>> There are a number of SchemaTransforms already developed, but applying
>> these changes to them should be straightforward. If people think this is a
>> good idea, I can open a PR and implement them.
>>
>> Best,
>> Ahmed
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>> [3]
>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>> [4]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>> [5]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>> [6]
>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>
>


Client-Side Throttling in Apache Beam

2023-05-30 Thread Jack McCluskey via dev
Hey everyone,

While working on some remote model handler code I hit a point where I
needed to understand how Beam IOs interpret and action on being throttled
by an external service. This turned into a few discussions and then a small
write-up doc (
https://docs.google.com/document/d/1ePorJGZnLbNCmLD9mR7iFYOdPsyDA1rDnTpYnbdrzSU/edit?usp=sharing)
to encapsulate the basics of what I learned. If you're familiar with this
topic feel free to make suggestions on the doc, I'm intending to add this
to the wiki so there's a resource for how this works in the future!

Thanks,

Jack McCluskey

-- 


Jack McCluskey
SWE - DataPLS PLAT/ Dataflow ML
RDU
jrmcclus...@google.com


Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Robert Bradshaw via dev
+1 to this simplification, it's a historical artifact that provides no
value.

I would love it if we also looked into ways to auto-generate the
SchemaTransformProvider (e.g. via introspection if a transform takes a
small number of arguments, or uses the standard builder pattern...),
ideally with something as simple as adding a decorator to the PTransform
itself.


On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev 
wrote:

> Hey everyone,
>
> I was looking at how we use SchemaTransforms in our expansion service.
> From what I see, there may be a redundant step in developing
> SchemaTransforms. Currently, we have 3 pieces:
> - SchemaTransformProvider [1]
> - A configuration object
> - SchemaTransform [2]
>
> The API is generally used like this:
> 1. The SchemaTransformProvider takes a configuration object and returns a
> SchemaTransform
> 2. The SchemaTransform is used to build a PTransform according to the
> configuration
>
> In these steps, the SchemaTransform class seems unnecessary. We can
> combine the two steps if we have SchemaTransformProvider return the
> PTransform directly.
>
> We can then remove the SchemaTransform class as it will be obsolete. This
> should be safe to do; the only place it's used in our API is here [3], and
> that can be simplified if we make this change (we'd just trim `
> .buildTransform()` off the end as `provider.from(configRow)` will
> directly return the PTransform).
>
> I'd like to first mention that I was not involved in the design process of
> this API so I may be missing some information on why it was set up this way.
>
> A few developers already raised questions about how there's seemingly
> unnecessary boilerplate involved in making a Java transform portable. I
> wasn't involved in the design process of this API so I may be missing some
> information, but my assumption is this was designed to follow the pattern
> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
> SchemaIO[5] -> PTransform. However, with the newer
> SchemaTransformProvider API, we dropped a few methods and reduced the
> SchemaTransform class to have a generic buildTransform() method. See the
> example of PubsubReadSchemaTransformProvider [6], where the
> SchemaTransform interface and buildTransform method are implemented just
> to satisfy the requirement that SchemaTransformProvider::from return a
> SchemaTransform.
>
> I'm bringing this up because if we are looking to encourage contribution
> to cross-language use cases, we should make it simpler and less convoluted
> to develop portable transforms.
>
> There are a number of SchemaTransforms already developed, but applying
> these changes to them should be straightforward. If people think this is a
> good idea, I can open a PR and implement them.
>
> Best,
> Ahmed
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
> [3]
> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
> [5]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
> [6]
> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>


Re: Hierarchical fanout with Beam combiners?

2023-05-30 Thread Robert Bradshaw via dev
On Tue, May 30, 2023 at 10:37 AM Kenneth Knowles  wrote:

>
> On Sat, May 27, 2023 at 4:20 PM Stephan Hoyer via dev 
> wrote:
>
>> On Fri, May 26, 2023 at 2:59 PM Robert Bradshaw 
>> wrote:
>>
>>> Yes, with_hot_key_fanout only performs a single level of fanout. I don't
>>> think fanning out more than this has been explored, but I would imagine
>>> that for most cases the increased IO would negate most if not all of the
>>> benefits.
>>>
>>
>> My reasoning for multi-level fanout would be that the total amount of IO
>> is that it converges as a geometric series: at each level, the amount of
>> data is reduced by a factor of 1/fanout. So even if fanout=2 at each level,
>> the total amount of IO is twice the IO of not using fanout at all. The
>> general IO overhead would be a factor of "fanout / (fanout - 1)".
>>
>>
>>> In particular, note that we already do "combiner lifting" to do as much
>>> combining as we can on the mapper side, e.g. suppose we have M elements and
>>> N workers. Each worker will (to a first order of approximation) combine M/N
>>> elements down to a single element, leaving N elements total to be combined
>>> by a worker in the subsequent stage. If N is large (or the combining
>>> computation expensive) one can use with_hot_key_fanout to add an
>>> intermediate step and let the N workers each combine M/N elements into
>>> sqrt(N) partial aggregates, and the subsequent worker only needs to combine
>>> the sqrt(N) partial aggregates. Generally N (the number of workers, not the
>>> number of elements) is small enough that multiple levels are not needed.
>>>
>>
>> Thanks for clarifying Robert. I did not realize that "combiner lifting"
>> was a thing! We had been operating under the assumption that we should use
>> fanout to sqrt(M), which could indeed be bigger than sqrt(N). In general
>> "with_hot_key_fanout" could use documentation to explain exactly what the
>> parameter means and to indicate suggested usage (e.g., set it to sqrt(N)).
>>
>> I will say that one other concern for us is memory usage. We typically
>> work with large, image-like data coming out of weather simulations, which
>> we try to divide into 10-100 MB chunks. With 1000 workers, this would
>> suggest fanout=30, which means combining up to 3 GB of data on a single
>> machine. This is probably fine but doesn't leave a large machine for error.
>>
>
> One thing in Robert's "to a first approximation" is that the pre-shuffle
> combine does flush things to shuffle to avoid running out of memory. This
> logic is per-SDK (because of how the combiner is invoked and also how to
> measure memory footprint, what caching data structure are performant, etc).
> So if this is working as intended, the impact of doing a large amount of
> pre-combining on a machine is just a lesser benefit because of having to
> flush more than one result per key, not excessive memory pressure. I'm sure
> it is inexact in specific cases, though.
>
>>
+1.

Also, the general CombineFn only combines a single element into the
aggregate at a time--no need to store all the inputs in memory at the same
time. (The flushing is primarily a concern for the many-keys situation, in
which case many aggregates are stored, one per key.)

 On Fri, May 26, 2023 at 1:57 PM Stephan Hoyer via dev 
>>> wrote:
>>>
 We have some use-cases where we are combining over very large sets
 (e.g., computing the average of 1e5 to 1e6 elements, corresponding to
 hourly weather observations over the past 50 years).

 "with_hot_key_fanout" seems to be rather essential for performing these
 calculations, but as far as I can tell it only performs a single level of
 fanout, i.e., instead of summing up 1e6 elements on a single node, you sum
 1e3 element 1e3 times, and then sum those 1e3 results together.

 My guess is that such combiners could be much more efficient if this
 was performed in a hierarchical/multi-stage fashion proportional to
 log(element_count), e.g., summing 100 elements with 3 stages, or maybe
 summing 10 elements with 6 stages. Dask uses such a "tree reduction"
 strategy as controlled by the "split_every" parameter:
 https://github.com/dask/dask/blob/453bd7031828f72e4602498c1a1f776280794bea/dask/array/reductions.py#L109

 I understand that the number of fanout stages could not be computed
 automatically in the Beam data model, but it would still be nice to be able
 to specify this manually. Has there been any thought to introducing this
 feature?

 Thanks,
 Stephan

>>>


Re: Hierarchical fanout with Beam combiners?

2023-05-30 Thread Kenneth Knowles
On Sat, May 27, 2023 at 4:20 PM Stephan Hoyer via dev 
wrote:

> On Fri, May 26, 2023 at 2:59 PM Robert Bradshaw 
> wrote:
>
>> Yes, with_hot_key_fanout only performs a single level of fanout. I don't
>> think fanning out more than this has been explored, but I would imagine
>> that for most cases the increased IO would negate most if not all of the
>> benefits.
>>
>
> My reasoning for multi-level fanout would be that the total amount of IO
> is that it converges as a geometric series: at each level, the amount of
> data is reduced by a factor of 1/fanout. So even if fanout=2 at each level,
> the total amount of IO is twice the IO of not using fanout at all. The
> general IO overhead would be a factor of "fanout / (fanout - 1)".
>
>
>> In particular, note that we already do "combiner lifting" to do as much
>> combining as we can on the mapper side, e.g. suppose we have M elements and
>> N workers. Each worker will (to a first order of approximation) combine M/N
>> elements down to a single element, leaving N elements total to be combined
>> by a worker in the subsequent stage. If N is large (or the combining
>> computation expensive) one can use with_hot_key_fanout to add an
>> intermediate step and let the N workers each combine M/N elements into
>> sqrt(N) partial aggregates, and the subsequent worker only needs to combine
>> the sqrt(N) partial aggregates. Generally N (the number of workers, not the
>> number of elements) is small enough that multiple levels are not needed.
>>
>
> Thanks for clarifying Robert. I did not realize that "combiner lifting"
> was a thing! We had been operating under the assumption that we should use
> fanout to sqrt(M), which could indeed be bigger than sqrt(N). In general
> "with_hot_key_fanout" could use documentation to explain exactly what the
> parameter means and to indicate suggested usage (e.g., set it to sqrt(N)).
>
> I will say that one other concern for us is memory usage. We typically
> work with large, image-like data coming out of weather simulations, which
> we try to divide into 10-100 MB chunks. With 1000 workers, this would
> suggest fanout=30, which means combining up to 3 GB of data on a single
> machine. This is probably fine but doesn't leave a large machine for error.
>

One thing in Robert's "to a first approximation" is that the pre-shuffle
combine does flush things to shuffle to avoid running out of memory. This
logic is per-SDK (because of how the combiner is invoked and also how to
measure memory footprint, what caching data structure are performant, etc).
So if this is working as intended, the impact of doing a large amount of
pre-combining on a machine is just a lesser benefit because of having to
flush more than one result per key, not excessive memory pressure. I'm sure
it is inexact in specific cases, though.

Kenn


>
>
>>  On Fri, May 26, 2023 at 1:57 PM Stephan Hoyer via dev <
>> dev@beam.apache.org> wrote:
>>
>>> We have some use-cases where we are combining over very large sets
>>> (e.g., computing the average of 1e5 to 1e6 elements, corresponding to
>>> hourly weather observations over the past 50 years).
>>>
>>> "with_hot_key_fanout" seems to be rather essential for performing these
>>> calculations, but as far as I can tell it only performs a single level of
>>> fanout, i.e., instead of summing up 1e6 elements on a single node, you sum
>>> 1e3 element 1e3 times, and then sum those 1e3 results together.
>>>
>>> My guess is that such combiners could be much more efficient if this was
>>> performed in a hierarchical/multi-stage fashion proportional to
>>> log(element_count), e.g., summing 100 elements with 3 stages, or maybe
>>> summing 10 elements with 6 stages. Dask uses such a "tree reduction"
>>> strategy as controlled by the "split_every" parameter:
>>> https://github.com/dask/dask/blob/453bd7031828f72e4602498c1a1f776280794bea/dask/array/reductions.py#L109
>>>
>>> I understand that the number of fanout stages could not be computed
>>> automatically in the Beam data model, but it would still be nice to be able
>>> to specify this manually. Has there been any thought to introducing this
>>> feature?
>>>
>>> Thanks,
>>> Stephan
>>>
>>


Re: [Proposal] Remove of Fix the Beam Dependency Check Report Job

2023-05-30 Thread Kenneth Knowles
+1 to just stopping the automated email. I don't find it valuable. It was
never finely-tuned enough in terms of actionability vs spam volume.

On Tue, May 30, 2023 at 7:24 AM Jack McCluskey via dev 
wrote:

> Hi everyone,
>
> Just bumping this again now that the long weekend is behind us. If no one
> advocates for fixing the job in the next few days I'll assume a lazy
> consensus and remove it.
>
> I also want to point out a typo in the subject, it should be "Remove *or*
>  Fix."
>
> Thanks,
>
> Jack McCluskey
>
> On Thu, May 25, 2023 at 3:16 PM Jack McCluskey 
> wrote:
>
>> Hey everyone,
>>
>> The Beam Dependency Check Report email (like
>> https://lists.apache.org/thread/tc9v1d66rx77wzvrjnkcf0jo3rxtmrhn) has
>> not had a successful incarnation since July 21st, 2022. I've done a little
>> bit of digging into the problem and have found that the issue lies in a
>> query
>> 
>> to a "Python Compatibility Checking Service" that is just an IP address
>> also taking the package name, version, and then specifying that it wants
>> Python 2 packages specifically. I made a few brief attempts to figure out
>> what that IP address was supposed to lead to and didn't turn up anything;
>> however, that doesn't seem to matter since the root of the problem is that
>> the job cannot connect to anything at that address, so the build fails and
>> the email is sent out without a body.
>>
>> I started a bit of work this afternoon trying to update the job to direct
>> its Python-related queries to PyPi's JSON API (
>> https://github.com/apache/beam/pull/26897); however, I question the need
>> for this automated email at all given that we added Dependabot to the
>> repository around 6 weeks before the Jenkins job started failing. If
>> there's a good reason to fix it I'll keep digging, otherwise I'm in favor
>> of removing the job altogether.
>>
>> Thanks,
>>
>> Jack McCluskey
>>
>> --
>>
>>
>> Jack McCluskey
>> SWE - DataPLS PLAT/ Dataflow ML
>> RDU
>> jrmcclus...@google.com
>>
>>
>>


Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Kenneth Knowles
+1 in general (I will dig into details too, but later)

I will add a couple more steps that I dislike, which exist only for
historical obsolete reasons:

The whole existence of runners/core-construction-java and everything in it.
The main reason it exists is to make it so that main SDK has no dependency
on protobuf. We thought maybe the portable model would be generic enough to
have multiple possible encodings. That idea is obsolete. We are using
protobuf.

So that whole module, everything about generating proto for transforms,
could be in the core SDK to remove the need for the service loader pattern,
which often breaks when users build uber jars, and just generally is
overused and needless complexity here.

Not sure how this relates in detail to your proposal, but I just wanted to
weigh in with a big +1 to making it easy to make a portable schema-based
transform by default. Imagine a world where the Java-centric API is
actually a wrapper on the schema transform!

Kenn

On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev 
wrote:

> Hey everyone,
>
> I was looking at how we use SchemaTransforms in our expansion service.
> From what I see, there may be a redundant step in developing
> SchemaTransforms. Currently, we have 3 pieces:
> - SchemaTransformProvider [1]
> - A configuration object
> - SchemaTransform [2]
>
> The API is generally used like this:
> 1. The SchemaTransformProvider takes a configuration object and returns a
> SchemaTransform
> 2. The SchemaTransform is used to build a PTransform according to the
> configuration
>
> In these steps, the SchemaTransform class seems unnecessary. We can
> combine the two steps if we have SchemaTransformProvider return the
> PTransform directly.
>
> We can then remove the SchemaTransform class as it will be obsolete. This
> should be safe to do; the only place it's used in our API is here [3], and
> that can be simplified if we make this change (we'd just trim `
> .buildTransform()` off the end as `provider.from(configRow)` will
> directly return the PTransform).
>
> I'd like to first mention that I was not involved in the design process of
> this API so I may be missing some information on why it was set up this way.
>
> A few developers already raised questions about how there's seemingly
> unnecessary boilerplate involved in making a Java transform portable. I
> wasn't involved in the design process of this API so I may be missing some
> information, but my assumption is this was designed to follow the pattern
> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
> SchemaIO[5] -> PTransform. However, with the newer
> SchemaTransformProvider API, we dropped a few methods and reduced the
> SchemaTransform class to have a generic buildTransform() method. See the
> example of PubsubReadSchemaTransformProvider [6], where the
> SchemaTransform interface and buildTransform method are implemented just
> to satisfy the requirement that SchemaTransformProvider::from return a
> SchemaTransform.
>
> I'm bringing this up because if we are looking to encourage contribution
> to cross-language use cases, we should make it simpler and less convoluted
> to develop portable transforms.
>
> There are a number of SchemaTransforms already developed, but applying
> these changes to them should be straightforward. If people think this is a
> good idea, I can open a PR and implement them.
>
> Best,
> Ahmed
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
> [3]
> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
> [5]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
> [6]
> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>


Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Alexey Romanenko
+1 (binding)

Tested with  https://github.com/Talend/beam-samples/ 
(Java SDK v8/v11/v17, Spark 3.x runner).

> On 27 May 2023, at 19:38, Bruno Volpato via dev  wrote:
> 
> I was able to check that containers are all there and complete my validation.
> 
> +1 (non-binding).
> 
> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java 
> SDK 11, Dataflow runner).
> 
> 
> Thanks Ritesh and Danny!
> 
> On Fri, May 26, 2023 at 10:09 AM Danny McCormick via dev  > wrote:
>> It looks like some Dataflow containers didn't get published, so some jobs 
>> using the legacy runner (runner v2 disabled) will fail. I kicked off the 
>> container release, so that should hopefully be available later today.
>> 
>> Thanks,
>> Danny
>> 
>> On Thu, May 25, 2023 at 11:19 PM Ritesh Ghorse via dev > > wrote:
>>> Hi everyone,
>>> Please review and vote on the release candidate #2 for the version 2.48.0, 
>>> as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>> 
>>> 
>>> Reviewers are encouraged to test their own use cases with the release 
>>> candidate, and vote +1 if no issues are found. Only PMC member votes will 
>>> count towards the final vote, but votes from all community members are 
>>> encouraged and helpful for finding regressions; you can either test your 
>>> own use cases or use cases from the validation sheet [10].
>>> 
>>> The complete staging area is available for your review, which includes:
>>> * GitHub Release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org 
>>>  [2], which is signed with the key with 
>>> fingerprint E4C74BEC861570F5A3E44E46280A0AC32DBAE62B [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.48.0-RC2" [5],
>>> * website pull request listing the release [6], the blog post [6], and 
>>> publishing the API reference manual [7] (to be generated).
>>> * Java artifacts were built with Gradle 7.5.1 and OpenJDK/Oracle JDK 
>>> 8.0.322. 
>>> * Python artifacts are deployed along with the source release to the 
>>> dist.apache.org  [2] and PyPI[8].
>>> * Go artifacts and documentation are available at pkg.go.dev 
>>>  [9]
>>> * Validation sheet with a tab for 2.48.0 release to help with validation 
>>> [10].
>>> * Docker images published to Docker Hub [11].
>>> * PR to run tests against release branch [12].
>>> 
>>> The vote will be open for at least 72 hours. It is adopted by majority 
>>> approval, with at least 3 PMC affirmative votes.
>>> 
>>> For guidelines on how to try the release in your projects, check out our 
>>> blog post at /blog/validate-beam-release/.
>>> 
>>> NOTE: Dataflow containers for Python are not finalized yet (likely to 
>>> happen on tuesday). I will follow up on this thread once that is done. Feel 
>>> free to test it on other runners until then. 
>>> 
>>> Thanks,
>>> Ritesh Ghorse
>>> 
>>> [1] https://github.com/apache/beam/milestone/12
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.48.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4] https://repository.apache.org/content/repositories/orgapachebeam-1346/
>>> [5] https://github.com/apache/beam/tree/v2.48.0-RC2
>>> [6] https://github.com/apache/beam/pull/26903
>>> [7] https://github.com/apache/beam-site/pull/645
>>> [8] https://pypi.org/project/apache-beam/2.48.0rc2/
>>> [9] 
>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.48.0-RC2/go/pkg/beam 
>>> 
>>> [10] 
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=458120434
>>> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
>>> [12] https://github.com/apache/beam/pull/26811
>>> 



Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Ahmed Abualsaud via dev
Hey everyone,

I was looking at how we use SchemaTransforms in our expansion service. From
what I see, there may be a redundant step in developing SchemaTransforms.
Currently, we have 3 pieces:
- SchemaTransformProvider [1]
- A configuration object
- SchemaTransform [2]

The API is generally used like this:
1. The SchemaTransformProvider takes a configuration object and returns a
SchemaTransform
2. The SchemaTransform is used to build a PTransform according to the
configuration

In these steps, the SchemaTransform class seems unnecessary. We can combine
the two steps if we have SchemaTransformProvider return the PTransform
directly.

We can then remove the SchemaTransform class as it will be obsolete. This
should be safe to do; the only place it's used in our API is here [3], and
that can be simplified if we make this change (we'd just trim `
.buildTransform()` off the end as `provider.from(configRow)` will directly
return the PTransform).

I'd like to first mention that I was not involved in the design process of
this API so I may be missing some information on why it was set up this way.

A few developers already raised questions about how there's seemingly
unnecessary boilerplate involved in making a Java transform portable. I
wasn't involved in the design process of this API so I may be missing some
information, but my assumption is this was designed to follow the pattern
of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
SchemaIO[5] -> PTransform. However, with the newer SchemaTransformProvider
API, we dropped a few methods and reduced the SchemaTransform class to have
a generic buildTransform() method. See the example of
PubsubReadSchemaTransformProvider [6], where the SchemaTransform interface
and buildTransform method are implemented just to satisfy the requirement
that SchemaTransformProvider::from return a SchemaTransform.

I'm bringing this up because if we are looking to encourage contribution to
cross-language use cases, we should make it simpler and less convoluted to
develop portable transforms.

There are a number of SchemaTransforms already developed, but applying
these changes to them should be straightforward. If people think this is a
good idea, I can open a PR and implement them.

Best,
Ahmed

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
[3]
https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
[4]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
[5]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
[6]
https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137


Re: [Proposal] Remove of Fix the Beam Dependency Check Report Job

2023-05-30 Thread Jack McCluskey via dev
Hi everyone,

Just bumping this again now that the long weekend is behind us. If no one
advocates for fixing the job in the next few days I'll assume a lazy
consensus and remove it.

I also want to point out a typo in the subject, it should be "Remove *or*
 Fix."

Thanks,

Jack McCluskey

On Thu, May 25, 2023 at 3:16 PM Jack McCluskey 
wrote:

> Hey everyone,
>
> The Beam Dependency Check Report email (like
> https://lists.apache.org/thread/tc9v1d66rx77wzvrjnkcf0jo3rxtmrhn) has not
> had a successful incarnation since July 21st, 2022. I've done a little bit
> of digging into the problem and have found that the issue lies in a query
> 
> to a "Python Compatibility Checking Service" that is just an IP address
> also taking the package name, version, and then specifying that it wants
> Python 2 packages specifically. I made a few brief attempts to figure out
> what that IP address was supposed to lead to and didn't turn up anything;
> however, that doesn't seem to matter since the root of the problem is that
> the job cannot connect to anything at that address, so the build fails and
> the email is sent out without a body.
>
> I started a bit of work this afternoon trying to update the job to direct
> its Python-related queries to PyPi's JSON API (
> https://github.com/apache/beam/pull/26897); however, I question the need
> for this automated email at all given that we added Dependabot to the
> repository around 6 weeks before the Jenkins job started failing. If
> there's a good reason to fix it I'll keep digging, otherwise I'm in favor
> of removing the job altogether.
>
> Thanks,
>
> Jack McCluskey
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Dataflow ML
> RDU
> jrmcclus...@google.com
>
>
>


Beam High Priority Issue Report (32)

2023-05-30 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/26911 [Bug]: UNNEST ARRAY with a nested 
ROW (described below)
https://github.com/apache/beam/issues/26550 [Failing Test]: 
beam_PostCommit_Java_PVR_Spark_Batch
https://github.com/apache/beam/issues/26547 [Failing Test]: 
beam_PostCommit_Java_DataflowV2
https://github.com/apache/beam/issues/26354 [Bug]: BigQueryIO direct read not 
reading all rows when set --setEnableBundling=true
https://github.com/apache/beam/issues/26343 [Bug]: 
apache_beam.io.gcp.bigquery_read_it_test.ReadAllBQTests.test_read_queries is 
flaky
https://github.com/apache/beam/issues/26329 [Bug]: BigQuerySourceBase does not 
propagate a Coder to AvroSource
https://github.com/apache/beam/issues/26041 [Bug]: Unable to create 
exactly-once Flink pipeline with stream source and file sink
https://github.com/apache/beam/issues/25975 [Bug]: Reducing parallelism in 
FlinkRunner leads to a data loss
https://github.com/apache/beam/issues/24776 [Bug]: Race condition in Python SDK 
Harness ProcessBundleProgress
https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/21714 
PulsarIOTest.testReadFromSimpleTopic is very flaky
https://github.com/apache/beam/issues/21708 beam_PostCommit_Java_DataflowV2, 
testBigQueryStorageWrite30MProto failing consistently
https://github.com/apache/beam/issues/21706 Flaky timeout in github Python unit 
test action 
StatefulDoFnOnDirectRunnerTest.test_dynamic_timer_clear_then_set_timer
https://github.com/apache/beam/issues/21643 FnRunnerTest with non-trivial 
(order 1000 elements) numpy input flakes in non-cython environment
https://github.com/apache/beam/issues/21476 WriteToBigQuery Dynamic table 
destinations returns wrong tableId
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21424 Java VR (Dataflow, V2, Streaming) 
failing: ParDoTest$TimestampTests/OnWindowExpirationTests
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19465 Explore possibilities to lower 
in-use IP address quota footprint.


P1 Issues with no update in the last week:

https://github.com/apache/beam/issues/23525 [Bug]: Default PubsubMessage coder 
will drop message id and orderingKey
https://github.com/apache/beam/issues/21645 
beam_PostCommit_XVR_GoUsingJava_Dataflow fails on some test transforms