Re: [Beam Playground] Local Development Environment: Kubernetes vs Docker Compose

2022-12-12 Thread Chamikara Jayalath via dev
For this kind of decisions, I'd write a short doc with pros and cons and
suggest an option. We can further discuss in the doc or dev list if needed.
If there's a significant disagreement we could even go for a vote in the
dev list but usually we do not get to that (and go by lazy consensus [1]).

BTW we had a very similar discussion previously regarding using one of
these systems for hosting datastores for Beam I/O testing.
https://lists.apache.org/thread/r0gn5fzp6zy6c277r1sqvb4o9rc45rxf

Thanks,
Cham

[1] https://community.apache.org/committers/lazyConsensus.html

On Mon, Dec 12, 2022 at 11:16 AM Damon Douglas via dev 
wrote:

> Hello Everyone,
>
> *Even if this is your first day learning Beam, please feel welcome to
> vote.*
>
> *Please cast your single question answer on your preference* for
> Kubernetes [1] versus Docker Compose [2] in local development of the Beam
> Playground [3].  The form provides short and long
> versioned explanations, if needed.
>
> *https://forms.gle/GBZZ9nCzj5EvXVgQ8
>  *
>
> Thank you for your time and help.
>
> Best,
>
> Damon
>
> *References*:
>
> 1. Kubernetes - an open-source system for automating deployment, scaling,
> and management of containerized applications.
> See https://kubernetes.io/
> 2. Docker Compose - a tool for defining and running multi-container Docker
> applications.
> See https://docs.docker.com/compose/
> 3. Beam Playground - a full stack web application to execute Apache Beam
> snippets in a modern browser.
> See https://play.beam.apache.org/
>


Re: [Proposal] Adopt a Beam I/O Standard

2022-12-12 Thread Chamikara Jayalath via dev
Yeah, I don't think either finalized or documented (in the Website) the
previous iteration. This doc seems to contain details from the documents
shared in the previous iteration.

Thanks,
Cham



On Mon, Dec 12, 2022 at 6:49 PM Robert Burke  wrote:

> I think ultimately: until the docs a clearly available on the Beam site
> itself, it's not documentation. See also, design docs, previous emails, and
> similar.
>
> On Mon, Dec 12, 2022, 6:07 PM Andrew Pilloud via dev 
> wrote:
>
>> I believe the previous iteration was here:
>> https://lists.apache.org/thread/3o8glwkn70kqjrf6wm4dyf8bt27s52hk
>>
>> The associated docs are:
>> https://s.apache.org/beam-io-api-standard-documentation
>> https://s.apache.org/beam-io-api-standard
>>
>> This is missing all the relational stuff that was in those docs, this
>> appears to be another attempt starting from the beginning?
>>
>> Andrew
>>
>>
>> On Mon, Dec 12, 2022 at 9:57 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Thanks for writing this!
>>>
>>> IIRC, the similar design doc was sent for review here a while ago. Is
>>> this just an updated version and a new one?
>>>
>>> —
>>> Alexey
>>>
>>> On 11 Dec 2022, at 15:16, Herman Mak via dev 
>>> wrote:
>>>
>>> Hello Everyone,
>>>
>>> *TLDR*
>>>
>>> Should we adopt a set of standards that Connector I/Os should adhere to?
>>> Attached is a first version of a Beam I/O Standards guideline that
>>> includes opinionated best practices across important components of a
>>> Connector I/O, namely Documentation, Development and Testing.
>>>
>>> *The Long Version*
>>>
>>> Apache Beam is a unified open-source programming model for both batch
>>> and streaming. It runs on multiple platform runners and integrates with
>>> over 50 services using individually developed I/O Connectors
>>> .
>>>
>>> Given that Apache Beam connectors are written by many different
>>> developers and at varying points in time, they vary in syntax style,
>>> documentation completeness and testing done. For a new adopter of Apache
>>> Beam, that can definitely cause some uncertainty.
>>>
>>> So should we adopt a set of standards that Connector I/Os should adhere
>>> to?
>>> Attached is a first version, in Doc format, of a Beam I/O Standards
>>> guideline that includes opinionated best practices across important
>>> components of a Connector I/O, namely Documentation, Development and
>>> Testing. And the aim is to incorporate this into the documentation and to
>>> have it referenced as standards for new Connector I/Os (and ideally have
>>> existing Connectors upgraded over time). If it looks helpful, the immediate
>>> next step is that we can convert it into a .md as a PR into the Beam repo!
>>>
>>> Thanks and looking forward to feedbacks and discussion,
>>>
>>>  [PUBLIC] Beam I/O Standards
>>> 
>>>
>>> Herman Mak |  Customer Engineer, Hong Kong, Google Cloud |
>>> herman...@google.com |  +852-3923-5417 <+852%203923%205417>
>>>
>>>
>>>
>>>


Re: [Proposal] Adopt a Beam I/O Standard

2022-12-12 Thread Robert Burke
I think ultimately: until the docs a clearly available on the Beam site
itself, it's not documentation. See also, design docs, previous emails, and
similar.

On Mon, Dec 12, 2022, 6:07 PM Andrew Pilloud via dev 
wrote:

> I believe the previous iteration was here:
> https://lists.apache.org/thread/3o8glwkn70kqjrf6wm4dyf8bt27s52hk
>
> The associated docs are:
> https://s.apache.org/beam-io-api-standard-documentation
> https://s.apache.org/beam-io-api-standard
>
> This is missing all the relational stuff that was in those docs, this
> appears to be another attempt starting from the beginning?
>
> Andrew
>
>
> On Mon, Dec 12, 2022 at 9:57 AM Alexey Romanenko 
> wrote:
>
>> Thanks for writing this!
>>
>> IIRC, the similar design doc was sent for review here a while ago. Is
>> this just an updated version and a new one?
>>
>> —
>> Alexey
>>
>> On 11 Dec 2022, at 15:16, Herman Mak via dev  wrote:
>>
>> Hello Everyone,
>>
>> *TLDR*
>>
>> Should we adopt a set of standards that Connector I/Os should adhere to?
>> Attached is a first version of a Beam I/O Standards guideline that
>> includes opinionated best practices across important components of a
>> Connector I/O, namely Documentation, Development and Testing.
>>
>> *The Long Version*
>>
>> Apache Beam is a unified open-source programming model for both batch and
>> streaming. It runs on multiple platform runners and integrates with over 50
>> services using individually developed I/O Connectors
>> .
>>
>> Given that Apache Beam connectors are written by many different
>> developers and at varying points in time, they vary in syntax style,
>> documentation completeness and testing done. For a new adopter of Apache
>> Beam, that can definitely cause some uncertainty.
>>
>> So should we adopt a set of standards that Connector I/Os should adhere
>> to?
>> Attached is a first version, in Doc format, of a Beam I/O Standards
>> guideline that includes opinionated best practices across important
>> components of a Connector I/O, namely Documentation, Development and
>> Testing. And the aim is to incorporate this into the documentation and to
>> have it referenced as standards for new Connector I/Os (and ideally have
>> existing Connectors upgraded over time). If it looks helpful, the immediate
>> next step is that we can convert it into a .md as a PR into the Beam repo!
>>
>> Thanks and looking forward to feedbacks and discussion,
>>
>>  [PUBLIC] Beam I/O Standards
>> 
>>
>> Herman Mak |  Customer Engineer, Hong Kong, Google Cloud |
>> herman...@google.com |  +852-3923-5417 <+852%203923%205417>
>>
>>
>>
>>


Re: [Proposal] Adopt a Beam I/O Standard

2022-12-12 Thread Andrew Pilloud via dev
I believe the previous iteration was here:
https://lists.apache.org/thread/3o8glwkn70kqjrf6wm4dyf8bt27s52hk

The associated docs are:
https://s.apache.org/beam-io-api-standard-documentation
https://s.apache.org/beam-io-api-standard

This is missing all the relational stuff that was in those docs, this
appears to be another attempt starting from the beginning?

Andrew


On Mon, Dec 12, 2022 at 9:57 AM Alexey Romanenko 
wrote:

> Thanks for writing this!
>
> IIRC, the similar design doc was sent for review here a while ago. Is this
> just an updated version and a new one?
>
> —
> Alexey
>
> On 11 Dec 2022, at 15:16, Herman Mak via dev  wrote:
>
> Hello Everyone,
>
> *TLDR*
>
> Should we adopt a set of standards that Connector I/Os should adhere to?
> Attached is a first version of a Beam I/O Standards guideline that
> includes opinionated best practices across important components of a
> Connector I/O, namely Documentation, Development and Testing.
>
> *The Long Version*
>
> Apache Beam is a unified open-source programming model for both batch and
> streaming. It runs on multiple platform runners and integrates with over 50
> services using individually developed I/O Connectors
> .
>
> Given that Apache Beam connectors are written by many different developers
> and at varying points in time, they vary in syntax style, documentation
> completeness and testing done. For a new adopter of Apache Beam, that can
> definitely cause some uncertainty.
>
> So should we adopt a set of standards that Connector I/Os should adhere
> to?
> Attached is a first version, in Doc format, of a Beam I/O Standards
> guideline that includes opinionated best practices across important
> components of a Connector I/O, namely Documentation, Development and
> Testing. And the aim is to incorporate this into the documentation and to
> have it referenced as standards for new Connector I/Os (and ideally have
> existing Connectors upgraded over time). If it looks helpful, the immediate
> next step is that we can convert it into a .md as a PR into the Beam repo!
>
> Thanks and looking forward to feedbacks and discussion,
>
>  [PUBLIC] Beam I/O Standards
> 
>
> Herman Mak |  Customer Engineer, Hong Kong, Google Cloud |
> herman...@google.com |  +852-3923-5417 <+852%203923%205417>
>
>
>
>


Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Cristian Constantinescu
Hi,

"As for the pipeline update feature, we've long discussed
having "pick-your-implementation" transforms that specify
alternative, equivalent implementations."

Could someone point me to where this was discussed please? I seem to have
missed that whole topic. Is it like a dependency injection type of thing?
If so, it's one thing I would love to see in Beam.

Thanks,
Cristian

On Mon, Dec 12, 2022 at 4:23 PM Robert Bradshaw via dev 
wrote:

> Saving up all the breaking changes until a major release definitely
> has its downsides (look at Python 3). The migration path is often as
> important (if not more so) than the final destination.
>
> As for this particular change, I would question how the benefit (it's
> unclear what the exact benefit is--better internal organization?)
> exceeds the pain of making every user refactor their code. I think a
> stronger case can be made for things like the Avro dependency that
> cause real pain.
>
> As for the pipeline update feature, we've long discussed having
> "pick-your-implementation" transforms that specify alternative,
> equivalent implementations. Upgrades can choose the old one whereas
> new pipelines can get the latest and greatest. It won't solve all
> issues, and requires keeping old codepaths around, but could be an
> important step forward.
>
> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
> >
> > I agree with Mortiz. To answer a few specifics in my own words:
> >
> >  - It is a perfectly sensible refactor, but as a counterpoint without
> file-based IO the SDK isn't functional so it is also a reasonable design
> point to have this included. There are other things in the core SDK that
> are far less "core" and could be moved out with greater benefit. The main
> goal for any separation of modules would be lighter weight transitive
> dependencies, IMO.
> >
> >  - No, Beam has not made any deliberate breaking changes of this nature.
> Hence we are still on major version 2. We have made some bugfixes for data
> loss risks that could be called "breaking changes" but since the feature
> was unsafe to use in the first place we did not bump the major version.
> >
> >  - It is sometimes possible to do such a refactor and have the
> deprecated location proxy to the new location. In this case that seems hard
> to achieve.
> >
> >  - It is not actually necessary to maintain both locations, as we can
> declare the old location will be unmaintained (but left alone) and all new
> development goes to the new location. That isn't a great choice for users
> who may simply upgrade their SDK version and not notice that their old code
> is now pointing at a version that will not receive e.g. security updates.
> >
> >  - I like the style where if/when we transition from Beam 2 to Beam 3 we
> should have the exact functionality of Beam 3 available as an opt-in flag
> first. So if a user passes --beam-3 they get exactly what will be the
> default functionality when we bump the major version. It really is a
> problem to do a whole bunch of stuff feverishly before a major version
> bump. The other style that I think works well is the linux kernel style
> where major versions alternate between stable and unstable (in other words,
> returning to the 0.x style with every alternating version).
> >
> >  - I do think Beam suffers from fear and inability to do significant
> code gardening. I don't think backwards compatibility in the code sense is
> the biggest blocker. I think the "pipeline update" feature is perhaps the
> thing most holding Beam back from making radical rapid forward progress.
> >
> > Kenn
> >
> > On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:
> >>
> >> Hi Damon,
> >>
> >>
> >>
> >> I fear the current release / versioning strategy of Beam doesn’t lend
> itself well for such breaking changes. Alexey and I have spent quite some
> time discussing how to proceed with the problematic Avro dependency in core
> (and respectively AvroIO, of course).
> >>
> >> Such changes essentially always require duplicating code to continue
> supporting a deprecated legacy code path to not break users’ code. But this
> comes at a very high price. Until the deprecated code path can be finally
> removed again, it must be maintained in two places.
> >>
> >> Unfortunately, the removal of deprecated code is rather problematic
> without a major version release as it would break semantic versioning and
> people’s expectations. With that deprecations bear the inherent risk to
> unintentionally deplete quality rather than improving it.
> >>
> >> I’d therefore recommend against such efforts unless there’s very strong
> reasons to do so.
> >>
> >>
> >>
> >> Best, Moritz
> >>
> >>
> >>
> >> On 07.12.22, 18:05, "Damon Douglas via dev" 
> wrote:
> >>
> >>
> >>
> >> Hello Everyone, If you identify yourself on the Beam learning journey,
> even if this is your first day, please see yourself as a welcome
> participant in this conversation and consider reviewing the bottom portion
> of this 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Robert Bradshaw via dev
Saving up all the breaking changes until a major release definitely
has its downsides (look at Python 3). The migration path is often as
important (if not more so) than the final destination.

As for this particular change, I would question how the benefit (it's
unclear what the exact benefit is--better internal organization?)
exceeds the pain of making every user refactor their code. I think a
stronger case can be made for things like the Avro dependency that
cause real pain.

As for the pipeline update feature, we've long discussed having
"pick-your-implementation" transforms that specify alternative,
equivalent implementations. Upgrades can choose the old one whereas
new pipelines can get the latest and greatest. It won't solve all
issues, and requires keeping old codepaths around, but could be an
important step forward.

On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
>
> I agree with Mortiz. To answer a few specifics in my own words:
>
>  - It is a perfectly sensible refactor, but as a counterpoint without 
> file-based IO the SDK isn't functional so it is also a reasonable design 
> point to have this included. There are other things in the core SDK that are 
> far less "core" and could be moved out with greater benefit. The main goal 
> for any separation of modules would be lighter weight transitive 
> dependencies, IMO.
>
>  - No, Beam has not made any deliberate breaking changes of this nature. 
> Hence we are still on major version 2. We have made some bugfixes for data 
> loss risks that could be called "breaking changes" but since the feature was 
> unsafe to use in the first place we did not bump the major version.
>
>  - It is sometimes possible to do such a refactor and have the deprecated 
> location proxy to the new location. In this case that seems hard to achieve.
>
>  - It is not actually necessary to maintain both locations, as we can declare 
> the old location will be unmaintained (but left alone) and all new 
> development goes to the new location. That isn't a great choice for users who 
> may simply upgrade their SDK version and not notice that their old code is 
> now pointing at a version that will not receive e.g. security updates.
>
>  - I like the style where if/when we transition from Beam 2 to Beam 3 we 
> should have the exact functionality of Beam 3 available as an opt-in flag 
> first. So if a user passes --beam-3 they get exactly what will be the default 
> functionality when we bump the major version. It really is a problem to do a 
> whole bunch of stuff feverishly before a major version bump. The other style 
> that I think works well is the linux kernel style where major versions 
> alternate between stable and unstable (in other words, returning to the 0.x 
> style with every alternating version).
>
>  - I do think Beam suffers from fear and inability to do significant code 
> gardening. I don't think backwards compatibility in the code sense is the 
> biggest blocker. I think the "pipeline update" feature is perhaps the thing 
> most holding Beam back from making radical rapid forward progress.
>
> Kenn
>
> On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:
>>
>> Hi Damon,
>>
>>
>>
>> I fear the current release / versioning strategy of Beam doesn’t lend itself 
>> well for such breaking changes. Alexey and I have spent quite some time 
>> discussing how to proceed with the problematic Avro dependency in core (and 
>> respectively AvroIO, of course).
>>
>> Such changes essentially always require duplicating code to continue 
>> supporting a deprecated legacy code path to not break users’ code. But this 
>> comes at a very high price. Until the deprecated code path can be finally 
>> removed again, it must be maintained in two places.
>>
>> Unfortunately, the removal of deprecated code is rather problematic without 
>> a major version release as it would break semantic versioning and people’s 
>> expectations. With that deprecations bear the inherent risk to 
>> unintentionally deplete quality rather than improving it.
>>
>> I’d therefore recommend against such efforts unless there’s very strong 
>> reasons to do so.
>>
>>
>>
>> Best, Moritz
>>
>>
>>
>> On 07.12.22, 18:05, "Damon Douglas via dev"  wrote:
>>
>>
>>
>> Hello Everyone, If you identify yourself on the Beam learning journey, even 
>> if this is your first day, please see yourself as a welcome participant in 
>> this conversation and consider reviewing the bottom portion of this email 
>> for guidance. The
>>
>> Hello Everyone,
>>
>>
>>
>> If you identify yourself on the Beam learning journey, even if this is your 
>> first day, please see yourself as a welcome participant in this conversation 
>> and consider reviewing the bottom portion of this email for guidance.
>>
>>
>>
>> The Short Version (For those with Java Beam SDK knowledge):
>>
>>
>>
>> Should we migrate FileIO / TextIO and related classes from :sdks:java:core 
>> to :sdks:java:io:file?  If so, should we target such a migration to a 

[Beam Playground] Local Development Environment: Kubernetes vs Docker Compose

2022-12-12 Thread Damon Douglas via dev
Hello Everyone,

*Even if this is your first day learning Beam, please feel welcome to vote.*

*Please cast your single question answer on your preference* for Kubernetes
[1] versus Docker Compose [2] in local development of the Beam Playground
[3].  The form provides short and long versioned explanations, if needed.

*https://forms.gle/GBZZ9nCzj5EvXVgQ8  *

Thank you for your time and help.

Best,

Damon

*References*:

1. Kubernetes - an open-source system for automating deployment, scaling,
and management of containerized applications.
See https://kubernetes.io/
2. Docker Compose - a tool for defining and running multi-container Docker
applications.
See https://docs.docker.com/compose/
3. Beam Playground - a full stack web application to execute Apache Beam
snippets in a modern browser.
See https://play.beam.apache.org/


Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Kenneth Knowles
I agree with Mortiz. To answer a few specifics in my own words:

 - It is a perfectly sensible refactor, but as a counterpoint without
file-based IO the SDK isn't functional so it is also a reasonable design
point to have this included. There are other things in the core SDK that
are far less "core" and could be moved out with greater benefit. The main
goal for any separation of modules would be lighter weight transitive
dependencies, IMO.

 - No, Beam has not made any deliberate breaking changes of this nature.
Hence we are still on major version 2. We have made some bugfixes for data
loss risks that could be called "breaking changes" but since the feature
was unsafe to use in the first place we did not bump the major version.

 - It is sometimes possible to do such a refactor and have the deprecated
location proxy to the new location. In this case that seems hard to achieve.

 - It is not actually necessary to maintain both locations, as we can
declare the old location will be unmaintained (but left alone) and all new
development goes to the new location. That isn't a great choice for users
who may simply upgrade their SDK version and not notice that their old code
is now pointing at a version that will not receive e.g. security updates.

 - I like the style where if/when we transition from Beam 2 to Beam 3 we
should have the exact functionality of Beam 3 available as an opt-in flag
first. So if a user passes --beam-3 they get exactly what will be the
default functionality when we bump the major version. It really is a
problem to do a whole bunch of stuff feverishly before a major version
bump. The other style that I think works well is the linux kernel style
where major versions alternate between stable and unstable (in other words,
returning to the 0.x style with every alternating version).

 - I do think Beam suffers from fear and inability to do significant code
gardening. I don't think backwards compatibility in the code sense is the
biggest blocker. I think the "pipeline update" feature is perhaps the thing
most holding Beam back from making radical rapid forward progress.

Kenn

On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:

> Hi Damon,
>
>
>
> I fear the current release / versioning strategy of Beam doesn’t lend
> itself well for such breaking changes. Alexey and I have spent quite some
> time discussing how to proceed with the problematic Avro dependency in core
> (and respectively AvroIO, of course).
>
> Such changes essentially always require duplicating code to continue
> supporting a deprecated legacy code path to not break users’ code. But this
> comes at a very high price. Until the deprecated code path can be finally
> removed again, it must be maintained in two places.
>
> Unfortunately, the removal of deprecated code is rather problematic
> without a major version release as it would break semantic versioning and
> people’s expectations. With that deprecations bear the inherent risk to
> unintentionally deplete quality rather than improving it.
>
> I’d therefore recommend against such efforts unless there’s very strong
> reasons to do so.
>
>
>
> Best, Moritz
>
>
>
> On 07.12.22, 18:05, "Damon Douglas via dev"  wrote:
>
>
>
> Hello Everyone, If you identify yourself on the Beam learning journey,
> even if this is your first day, please see yourself as a welcome
> participant in this conversation and consider reviewing the bottom portion
> of this email for guidance. The
>
> Hello Everyone,
>
>
>
> *If you identify yourself on the Beam learning journey, even if this is
> your first day, please see yourself as a welcome participant in this
> conversation and consider reviewing the bottom portion of this email for
> guidance.*
>
>
>
> *The Short Version (For those with Java Beam SDK knowledge)*:
>
>
>
> Should we migrate FileIO / TextIO and related classes from :sdks:java:core
> to :sdks:java:io:file?  If so, should we target such a migration to a
> future Beam version with repeated announcements?  Does the Beam repository
> have any example of a similar change in the past?  What learnings from said
> past change could be potentially applied to this one?
>
>
>
> *The Long Version (For those on the learning path)*:
>
>
>
> This email is more about our repository organization rather than Beam.
> The proposal is to move two highly used classes (and anything related) in
> our Java SDK called FileIO [1] and TextIO [2].  The Beam GitHub repository
> uses a software called gradle [3], to automate routine code tasks such as
> build and test.  Gradle projects, such as Beam, organize code in what are
> called modules [4].  The three main ingredients that make a module are 1) a
> unique directory path, 2) a file called build.gradle (or build.gradle.kts)
> in this directory, 3) referencing the gradle module in a settings.gradle
> (or settings.gradle.kts) file at the root of the repository.
>
>
>
> The gradle documentation discusses why such organization might matter and
> how to achieve this 

[Proposal] Change to Default PubsubMessage Coder

2022-12-12 Thread Evan Galpin
Hi folks,

I'd like to solicit feedback on the notion of using
PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
default coder for Pubsub messages instead of the current default of
PubsubMessageWithAttributesCoder.

Not long ago, support for reading and writing Pubsub messages in Beam
including an OrderingKey was added[2].  Part of this change involved adding
a new Coder for PubsubMessage in order to capture and propagate the
orderingKey[1].  This change illuminated that in cases where the coder type
for PubsubMessage is inferred, it is possible to accidentally and silently
nullify fields like MessageId and OrderingKey in a way that is not at all
obvious to users[3].

So far two potential drawbacks of this proposal have been identified:
1. Update compatibility for pipelines using PubsubIO might require users to
explicitly specify the current default coder (
PubsubMessageWithAttributesCoder)
2. Messages would require a larger number of bytes to store as compared to
the current default (which could again be overcome by users specifying the
current default coder)

What other potential drawbacks might there be? I look forward to hearing
others' input!

Thanks,
Evan

[1]
https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
[2] https://github.com/apache/beam/pull/22216
[3] https://github.com/apache/beam/issues/23525


Re: [Proposal] Adopt a Beam I/O Standard

2022-12-12 Thread Alexey Romanenko
Thanks for writing this!

IIRC, the similar design doc was sent for review here a while ago. Is this just 
an updated version and a new one?

—
Alexey

> On 11 Dec 2022, at 15:16, Herman Mak via dev  wrote:
> 
> Hello Everyone,
> 
> *TLDR*
> 
> Should we adopt a set of standards that Connector I/Os should adhere to? 
> Attached is a first version of a Beam I/O Standards guideline that includes 
> opinionated best practices across important components of a Connector I/O, 
> namely Documentation, Development and Testing. 
> 
> *The Long Version*
> 
> Apache Beam is a unified open-source programming model for both batch and 
> streaming. It runs on multiple platform runners and integrates with over 50 
> services using individually developed I/O Connectors 
> . 
> 
> Given that Apache Beam connectors are written by many different developers 
> and at varying points in time, they vary in syntax style, documentation 
> completeness and testing done. For a new adopter of Apache Beam, that can 
> definitely cause some uncertainty.
> 
> So should we adopt a set of standards that Connector I/Os should adhere to? 
> Attached is a first version, in Doc format, of a Beam I/O Standards guideline 
> that includes opinionated best practices across important components of a 
> Connector I/O, namely Documentation, Development and Testing. And the aim is 
> to incorporate this into the documentation and to have it referenced as 
> standards for new Connector I/Os (and ideally have existing Connectors 
> upgraded over time). If it looks helpful, the immediate next step is that we 
> can convert it into a .md as a PR into the Beam repo!
> 
> Thanks and looking forward to feedbacks and discussion,
> 
>  [PUBLIC] Beam I/O Standards 
> 
> 
> 
> Herman Mak |   Customer Engineer, Hong Kong, Google Cloud |
> herman...@google.com  |+852-3923-5417 
> 
> 
> 



[Question] DebeziumIO ReadFromDebezium getting stuck

2022-12-12 Thread Miguel Hernández Sandoval
Hi everyone. I'm having trouble making a performance test work for the
Debezium connector. This test reads the events from a PostgreSQL database
produced by a number of operations (inserts, deletes, updates) done for ~20
min.
When running in DirectRunner the pipeline reads the messages, stops, and
outputs the messages that were read. The problem is that when it runs in
DataflowRunner, the pipeline doesn't stop and seems to be doing nothing,
since it's not making any progress or printing any helpful logs.

I know that DebeziumIO is still experimental so I'm not sure if it lacks
some feature that is causing it not to run properly in Dataflow or if it
needs some specific configuration.

Thank you all for your help.

Here's the PR and a Dataflow run:
- https://github.com/apache/beam/pull/22344
-
https://console.cloud.google.com/dataflow/jobs/us-west1/2022-12-07_12_51_09-13497028272697059516;bottomTab=JOB_LOGS;graphView=0;logsSeverity=ERROR?pageState=(%22dfTime%22:(%22s%22:%222022-12-07T20:51:09.921Z%22,%22e%22:%222022-12-07T21:55:08.936Z%22))=apache-beam-testing


- Mike Hernandez

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*


Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-12 Thread Moritz Mack
Hi Damon,

I fear the current release / versioning strategy of Beam doesn’t lend itself 
well for such breaking changes. Alexey and I have spent quite some time 
discussing how to proceed with the problematic Avro dependency in core (and 
respectively AvroIO, of course).
Such changes essentially always require duplicating code to continue supporting 
a deprecated legacy code path to not break users’ code. But this comes at a 
very high price. Until the deprecated code path can be finally removed again, 
it must be maintained in two places.
Unfortunately, the removal of deprecated code is rather problematic without a 
major version release as it would break semantic versioning and people’s 
expectations. With that deprecations bear the inherent risk to unintentionally 
deplete quality rather than improving it.
I’d therefore recommend against such efforts unless there’s very strong reasons 
to do so.

Best, Moritz

On 07.12.22, 18:05, "Damon Douglas via dev"  wrote:

Hello Everyone, If you identify yourself on the Beam learning journey, even if 
this is your first day, please see yourself as a welcome participant in this 
conversation and consider reviewing the bottom portion of this email for 
guidance. The

Hello Everyone,

If you identify yourself on the Beam learning journey, even if this is your 
first day, please see yourself as a welcome participant in this conversation 
and consider reviewing the bottom portion of this email for guidance.

The Short Version (For those with Java Beam SDK knowledge):

Should we migrate FileIO / TextIO and related classes from :sdks:java:core to 
:sdks:java:io:file?  If so, should we target such a migration to a future Beam 
version with repeated announcements?  Does the Beam repository have any example 
of a similar change in the past?  What learnings from said past change could be 
potentially applied to this one?

The Long Version (For those on the learning path):

This email is more about our repository organization rather than Beam.  The 
proposal is to move two highly used classes (and anything related) in our Java 
SDK called FileIO [1] and TextIO [2].  The Beam GitHub repository uses a 
software called gradle [3], to automate routine code tasks such as build and 
test.  Gradle projects, such as Beam, organize code in what are called modules 
[4].  The three main ingredients that make a module are 1) a unique directory 
path, 2) a file called build.gradle (or build.gradle.kts) in this directory, 3) 
referencing the gradle module in a settings.gradle (or settings.gradle.kts) 
file at the root of the repository.

The gradle documentation discusses why such organization might matter and how 
to achieve this with large projects [5].  Essentially, modules allow us to have 
mini-projects inside our large project and focus related automations to this 
one focused portion of our larger repository.  In Beam, we have the module 
:sdks:java:core [6] with all things related to the core of Beam, whereas we 
have separate modules related to reading from and writing to various resources 
within :sdks:java:io [7].

The proposal suggests moving the aforementioned file reading and writing 
classes, FileIO and TextIO, and anything related, to its own :sdks:java:io:file 
module.  This would correspond to a new sdks/java/io/file directory and moving 
these classes into sdks/java/io/file/main/java/org/apache/beam/sdk/io/file.

Definitions / References:

1. FileIO - a General-purpose transforms for working with files: listing files 
(matching), reading and writing.  See - 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileIO.html

2. TextIO - Similar to FileIO but focused on text files.  See 
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.html

3. Gradle - a build automation tool used by the Apache Beam repository to 
automate code-related tasks.  See 
https://docs.gradle.org/current/userguide/what_is_gradle.html

4. Gradle Module - a subsection of your larger repository.  See 
https://docs.gradle.org/current/userguide/dependency_management_terminology.html#sub:terminology_module

5. Structuring Large Projects with Gradle - 

Beam High Priority Issue Report (40)

2022-12-12 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need 
attention.

See https://beam.apache.org/contribute/issue-priorities for the meaning and 
expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/24389 [Failing Test]: 
HadoopFormatIOElasticTest.classMethod ExceptionInInitializerError 
ContainerFetchException
https://github.com/apache/beam/issues/24384 [Bug]: 
RampupThrottlingFnTest.testRampupThrottler TooManyActualInvocations
https://github.com/apache/beam/issues/24383 [Bug]: Daemon will be stopped at 
the end of the build after the daemon was no longer found in the daemon registry
https://github.com/apache/beam/issues/24367 [Bug]: workflow.tar.gz cannot be 
passed to flink runner
https://github.com/apache/beam/issues/24313 [Flaky]: 
apache_beam/runners/portability/portable_runner_test.py::PortableRunnerTestWithSubprocesses::test_pardo_state_with_custom_key_coder
https://github.com/apache/beam/issues/24267 [Failing Test]: Timeout waiting to 
lock gradle
https://github.com/apache/beam/issues/24263 [Bug]: Remote call on 
apache-beam-jenkins-3 failed. The channel is closing down or has closed down
https://github.com/apache/beam/issues/23944  beam_PreCommit_Python_Cron 
regularily failing - test_pardo_large_input flaky
https://github.com/apache/beam/issues/23709 [Flake]: Spark batch flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElement and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundle
https://github.com/apache/beam/issues/23286 [Bug]: 
beam_PerformanceTests_InfluxDbIO_IT Flaky > 50 % Fail 
https://github.com/apache/beam/issues/22969 Discrepancy in behavior of 
`DoFn.process()` when `yield` is combined with `return` statement, or vice versa
https://github.com/apache/beam/issues/22961 [Bug]: WriteToBigQuery silently 
skips most of records without job fail
https://github.com/apache/beam/issues/22913 [Bug]: 
beam_PostCommit_Java_ValidatesRunner_Flink is flakes in 
org.apache.beam.sdk.transforms.GroupByKeyTest$BasicTests.testAfterProcessingTimeContinuationTriggerUsingState
https://github.com/apache/beam/issues/22605 [Bug]: Beam Python failure for 
dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest.test_metrics_it
https://github.com/apache/beam/issues/22321 
PortableRunnerTestWithExternalEnv.test_pardo_large_input is regularly failing 
on jenkins
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output 
to Failed Inserts PCollection
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not 
raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21561 
ExternalPythonTransformTest.trivialPythonTransform flaky
https://github.com/apache/beam/issues/21480 flake: 
FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21474 Flaky tests: Gradle build daemon 
disappeared unexpectedly
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: 
Connection refused
https://github.com/apache/beam/issues/21462 Flake in 
org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21333 Flink testParDoRequiresStableInput 
flaky
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not 
follow spec
https://github.com/apache/beam/issues/21261 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit 
data at GC time
https://github.com/apache/beam/issues/21121 
apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
 flakey
https://github.com/apache/beam/issues/21113 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
https://github.com/apache/beam/issues/21104 Flaky: 
apache_beam.runners.portability.fn_api_runner.fn_runner_test.FnApiRunnerTestWithGrpcAndMultiWorkers
https://github.com/apache/beam/issues/20976 
apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
 is flaky
https://github.com/apache/beam/issues/20975 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with 
grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM 
on Flink
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit 
empty pane when it should
https://github.com/apache/beam/issues/19814 Flink streaming flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful and 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
https://github.com/apache/beam/issues/19734 
WatchTest.testMultiplePollsWithManyResults flake: Outputs must be in 

Re: [Proposal] Adopt a Beam I/O Standard

2022-12-12 Thread Moritz Mack
Thanks so much! Great to see this to be picked up again with some good progress.
/ Moritz

On 11.12.22, 15:17, "Herman Mak via dev"  wrote:

Hello Everyone, *TLDR* Should we adopt a set of standards that Connector I/Os 
should adhere to?  Attached is a first version of a Beam I/O Standards 
guideline that includes opinionated best practices across important components 
of a Connector

Hello Everyone,

*TLDR*

Should we adopt a set of standards that Connector I/Os should adhere to?
Attached is a first version of a Beam I/O Standards guideline that includes 
opinionated best practices across important components of a Connector I/O, 
namely Documentation, Development and Testing.

*The Long Version*

Apache Beam is a unified open-source programming model for both batch and 
streaming. It runs on multiple platform runners and integrates with over 50 
services using individually developed I/O 
Connectors.

Given that Apache Beam connectors are written by many different developers and 
at varying points in time, they vary in syntax style, documentation 
completeness and testing done. For a new adopter of Apache Beam, that can 
definitely cause some uncertainty.

So should we adopt a set of standards that Connector I/Os should adhere to?
Attached is a first version, in Doc format, of a Beam I/O Standards guideline 
that includes opinionated best practices across important components of a 
Connector I/O, namely Documentation, Development and Testing. And the aim is to 
incorporate this into the documentation and to have it referenced as standards 
for new Connector I/Os (and ideally have existing Connectors upgraded over 
time). If it looks helpful, the immediate next step is that we can convert it 
into a .md as a PR into the Beam repo!

Thanks and looking forward to feedbacks and discussion,

[https://drive-thirdparty.googleusercontent.com/16/type/application/vnd.google-apps.document]
 [PUBLIC] Beam I/O 
Standards


[https://lh4.googleusercontent.com/x-XXWb614Zh49ixe0GySnZFBXTBs2gzMCyVmAdh4_dSwrqW_4nmq3hq5TtPRFPJaj4I125-ehGvHxPKhiC07EhRK9VvfIKCWXcJsZd-WfCMgO5MK_7BhYEibHo3L_R0PqKgGAGjh]
Herman Mak |
 Customer Engineer, Hong Kong, Google Cloud |
 herman...@google.com |
 +852-3923-5417





As a recipient of an email from Talend, your contact personal data will be on 
our systems. Please see our privacy notice.