Re: [ANNOUNCE] New PMC Member: Alex Van Boxel

2023-10-03 Thread Ahmed Abualsaud via dev
Congratulations!

On Tue, Oct 3, 2023 at 3:48 PM Byron Ellis via dev 
wrote:

> Congrats!
>
> On Tue, Oct 3, 2023 at 12:40 PM Danielle Syse via dev 
> wrote:
>
>> Congratulations Alex!! Definitely well deserved!
>>
>> On Tue, Oct 3, 2023 at 2:57 PM Ahmet Altay via dev 
>> wrote:
>>
>>> Congratulations Alex! Well deserved!
>>>
>>> On Tue, Oct 3, 2023 at 11:54 AM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations Alex!

 On Tue, Oct 3, 2023 at 2:54 PM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> Congrats Alex, this is well deserved!
>
> On Tue, Oct 3, 2023 at 2:50 PM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> Congrats, Alex!
>>
>> On Tue, Oct 3, 2023 at 2:49 PM XQ Hu via dev 
>> wrote:
>>
>>> Configurations, Alex!
>>>
>>> On Tue, Oct 3, 2023 at 2:40 PM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming Alex Van
 Boxel  as our newest PMC member.

 Alex has been with Beam since 2016, very early in the life of the
 project. Alex has contributed code, design ideas, and perhaps most
 importantly been a huge part of organizing Beam Summits, and of course
 presenting at them as well. Alex really brings the ASF community 
 spirit to
 Beam.

 Congratulations Alex and thanks for being a part of Apache Beam!

 Kenn, on behalf of the Beam PMC (which now includes Alex)

>>>


Re: [ANNOUNCE] New committer: Ahmed Abualsaud

2023-08-28 Thread Ahmed Abualsaud via dev
Thanks to the PMC for these responsibilities, and thank you all for guiding
me along this journey. I'm looking forward to helping this community
however I can :)

Best,
Ahmed

On Sun, Aug 27, 2023 at 8:48 PM Reza Rokni via dev 
wrote:

> Congrats Ahmed!
>
> On Fri, Aug 25, 2023 at 2:34 PM John Casey via dev 
> wrote:
>
>> Congrats Ahmed!
>>
>> On Fri, Aug 25, 2023 at 10:43 AM Bjorn Pedersen via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Ahmed! Well deserved!
>>>
>>> On Fri, Aug 25, 2023 at 10:36 AM Yi Hu via dev 
>>> wrote:
>>>
 Congrats Ahmed!

 On Fri, Aug 25, 2023 at 10:11 AM Ritesh Ghorse via dev <
 dev@beam.apache.org> wrote:

> Congrats Ahmed!
>
> On Fri, Aug 25, 2023 at 9:53 AM Kerry Donny-Clark via dev <
> dev@beam.apache.org> wrote:
>
>> Well done Ahmed!
>>
>> On Fri, Aug 25, 2023 at 9:17 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Ahmed!
>>>
>>> On Fri, Aug 25, 2023 at 3:16 AM Jan Lukavský 
>>> wrote:
>>>
 Congrats Ahmed!
 On 8/25/23 07:56, Anand Inguva via dev wrote:

 Congratulations Ahmed :)

 On Fri, Aug 25, 2023 at 1:17 AM Damon Douglas <
 damondoug...@apache.org> wrote:

> Well deserved! Congratulations, Ahmed! I'm so happy for you.
>
> On Thu, Aug 24, 2023, 5:46 PM Byron Ellis via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations!
>>
>> On Thu, Aug 24, 2023 at 5:34 PM Robert Burke 
>> wrote:
>>
>>> Congratulations Ahmed!!
>>>
>>> On Thu, Aug 24, 2023, 4:08 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Ahmed!!

 On Thu, Aug 24, 2023 at 4:06 PM Bruno Volpato via dev <
 dev@beam.apache.org> wrote:

> Congratulations, Ahmed!
>
> Very well deserved!
>
>
> On Thu, Aug 24, 2023 at 6:09 PM XQ Hu via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations, Ahmed!
>>
>> On Thu, Aug 24, 2023, 5:49 PM Ahmet Altay via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a
>>> new committer: Ahmed Abualsaud (ahmedabuals...@apache.org).
>>>
>>> Ahmed has been part of the Beam community since January
>>> 2022, working mostly on IO connectors, made a large amount of 
>>> contributions
>>> to make Beam IOs more usable, performant, and reliable. And at 
>>> the same
>>> time Ahmed was active in the user list and at the Beam summit 
>>> helping users
>>> by sharing his knowledge.
>>>
>>> Considering their contributions to the project over this
>>> timeframe, the Beam PMC trusts Ahmed with the responsibilities 
>>> of a Beam
>>> committer. [1]
>>>
>>> Thank you Ahmed! And we are looking to see more of your
>>> contributions!
>>>
>>> Ahmet, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>>
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>>


Re: [ANNOUNCE] New committer: John Casey

2022-07-29 Thread Ahmed Abualsaud via dev
Congrats John, what a great addition!

On Fri, Jul 29, 2022 at 4:56 PM Kerry Donny-Clark via dev <
dev@beam.apache.org> wrote:

> John, you have made a huge impact on the many, many users of Kafka and
> other IOs. This is great recognition of your commitment to Beam.
> Kerry
>
> On Fri, Jul 29, 2022 at 4:46 PM Byron Ellis via dev 
> wrote:
>
>> Congratulations John!
>>
>> On Fri, Jul 29, 2022 at 1:09 PM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats John and welcome! This is well deserved!
>>>
>>> On Fri, Jul 29, 2022 at 4:07 PM Kenneth Knowles  wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming
 a new committer: John Casey (johnca...@apache.org)

 John started contributing to Beam in late 2021. John has quickly become
 our resident expert on KafkaIO - identifying bugs, making enhancements,
 helping users - in addition to a variety of other contributions.

 Considering his contributions to the project over this timeframe, the
 Beam PMC trusts John with the responsibilities of a Beam committer. [1]

 Thank you John! And we are looking to see more of your contributions!

 Kenn, on behalf of the Apache Beam PMC

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>


A lesson about DoFn retries

2022-09-01 Thread Ahmed Abualsaud via dev
Hi all,

TLDR: When writing IO connectors, be wary of how bundle retries can affect
the work flow.

A faulty implementation of a step in BigQuery batch loads was discovered
recently. I raised an issue [1] but also wanted to mention it here as a
potentially helpful lesson for those developing new/existing IO connectors.

For those unfamiliar with BigQueryIO file loads, a write that is too large
for a single load job [2] looks roughly something like this:


   1.

   Take input rows and write them to temporary files.
   2.

   Load temporary files to temporary BQ tables.
   3.

   Delete temporary files.
   4.

   Copy the contents of temporary tables over to the final table.
   5.

   Delete temporary tables.


The faulty part here is that steps 4 and 5 are done in the same DoFn (4 in
processElement and 5 in finishBundle). In the case a bundle fails in the
middle of table deletion, let’s say an error occurs when deleting the nth
table, the whole bundle will retry and we will perform the copy again. But
tables 1~n have already been deleted and so we get stuck trying to copy
from non-existent sources.

The solution lies in keeping the retry of each step separate. A good
example of this is in how steps 2 and 3 are implemented [3]. They are
separated into different DoFns and step 3 can start only after step 2
completes successfully. This way, any failure in step 3 does not go back to
affect step 2.

That's all, thanks for your attention :)

Ahmed

[1] https://github.com/apache/beam/issues/22920

[2]
https://github.com/apache/beam/blob/f921a2f1996cf906d994a9d62aeb6978bab09dd5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L100-L105


[3]
https://github.com/apache/beam/blob/149ed074428ff9b5344169da7d54e8ee271aaba1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L437-L454


Re: A lesson about DoFn retries

2022-09-02 Thread Ahmed Abualsaud via dev
Yes you’re right, I forgot to mention that important piece of information
 thanks for catching it.  The GBK keeps the DoFns separate at pipeline
execution.

>From what I’ve learned fusion is a Dataflow thing, do other runners do it
too?

On Thu, Sep 1, 2022 at 6:08 PM Brian Hulette  wrote:

> Thanks for sharing the learnings Ahmed!
>
> > The solution lies in keeping the retry of each step separate. A good
> example of this is in how steps 2 and 3 are implemented [3]. They are
> separated into different DoFns and step 3 can start only after step 2
> completes successfully. This way, any failure in step 3 does not go back to
> affect step 2. Is it enough just that they're in different DoFns? I thought
> the key was that the DoFns are separated by a GroupByKey, so they will be
> in different fused stages, which are retried independently.
>
> Brian
>
> On Thu, Sep 1, 2022 at 1:43 PM Ahmed Abualsaud via dev <
> dev@beam.apache.org> wrote:
>
>> Hi all,
>>
>> TLDR: When writing IO connectors, be wary of how bundle retries can
>> affect the work flow.
>>
>> A faulty implementation of a step in BigQuery batch loads was discovered
>> recently. I raised an issue [1] but also wanted to mention it here as a
>> potentially helpful lesson for those developing new/existing IO connectors.
>>
>> For those unfamiliar with BigQueryIO file loads, a write that is too
>> large for a single load job [2] looks roughly something like this:
>>
>>
>>1.
>>
>>Take input rows and write them to temporary files.
>>2.
>>
>>Load temporary files to temporary BQ tables.
>>3.
>>
>>Delete temporary files.
>>4.
>>
>>Copy the contents of temporary tables over to the final table.
>>5.
>>
>>Delete temporary tables.
>>
>>
>> The faulty part here is that steps 4 and 5 are done in the same DoFn (4
>> in processElement and 5 in finishBundle). In the case a bundle fails in
>> the middle of table deletion, let’s say an error occurs when deleting the n
>> th table, the whole bundle will retry and we will perform the copy
>> again. But tables 1~n have already been deleted and so we get stuck trying
>> to copy from non-existent sources.
>>
>> The solution lies in keeping the retry of each step separate. A good
>> example of this is in how steps 2 and 3 are implemented [3]. They are
>> separated into different DoFns and step 3 can start only after step 2
>> completes successfully. This way, any failure in step 3 does not go back to
>> affect step 2.
>>
>> That's all, thanks for your attention :)
>>
>> Ahmed
>>
>> [1] https://github.com/apache/beam/issues/22920
>>
>> [2]
>> https://github.com/apache/beam/blob/f921a2f1996cf906d994a9d62aeb6978bab09dd5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L100-L105
>>
>>
>> [3]
>> https://github.com/apache/beam/blob/149ed074428ff9b5344169da7d54e8ee271aaba1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L437-L454
>>
>>
>>


Re: [Proposal] Change to Default PubsubMessage Coder

2022-12-19 Thread Ahmed Abualsaud via dev
+1 to looking into using RowCoder, this may help avoid creating more
specialized coders in the future (which is mentioned as a pain point in the
issue you linked [1]).

[1] https://github.com/apache/beam/issues/23525#issuecomment-1281294275

On Tue, Dec 20, 2022 at 3:00 AM Andrew Pilloud via dev 
wrote:

> I think the Dataflow update concern is the biggest concern and as you
> point out that can be easily overcome. I generally believe that users who
> aren't setting the coder don't actually care as long as it works, so
> changing the default across Beam versions seems relatively low
> risk. Another mitigating factor is that both concerns require users to
> actually be using the coder (likely via Reshuffle.viaRandomKey) rather than
> consuming the output in a fused ParDo (which I think is what we would
> recommend).
>
> As a counter proposal: is there something that stops us from using
> RowCoder by default here? I assume all forms of "PubsubMessage" can be
> encoded with RowCoder, it provides flexibility for future changes, and
> PubSub will be able to take advantage of future work to make RowCoder more
> efficient. (If we can't switch to RowCoder, that seems like a useful
> feature request for RowCoder.)
>
> Andrew
>
> On Mon, Dec 19, 2022 at 3:37 PM Evan Galpin  wrote:
>
>> Bump 
>>
>> Any other risks or drawbacks associated with altering the default coder
>> for PubsubMessage to be the most inclusive coder with respect to possible
>> fields?
>>
>> Thanks,
>> Evan
>>
>>
>> On Mon, Dec 12, 2022 at 10:06 AM Evan Galpin  wrote:
>>
>>> Hi folks,
>>>
>>> I'd like to solicit feedback on the notion of using
>>> PubsubMessageWithAttributesAndMessageIdAndOrderingKeyCoder[1] as the
>>> default coder for Pubsub messages instead of the current default of
>>> PubsubMessageWithAttributesCoder.
>>>
>>> Not long ago, support for reading and writing Pubsub messages in Beam
>>> including an OrderingKey was added[2].  Part of this change involved adding
>>> a new Coder for PubsubMessage in order to capture and propagate the
>>> orderingKey[1].  This change illuminated that in cases where the coder type
>>> for PubsubMessage is inferred, it is possible to accidentally and silently
>>> nullify fields like MessageId and OrderingKey in a way that is not at all
>>> obvious to users[3].
>>>
>>> So far two potential drawbacks of this proposal have been identified:
>>> 1. Update compatibility for pipelines using PubsubIO might require users
>>> to explicitly specify the current default coder (
>>> PubsubMessageWithAttributesCoder)
>>> 2. Messages would require a larger number of bytes to store as compared
>>> to the current default (which could again be overcome by users specifying
>>> the current default coder)
>>>
>>> What other potential drawbacks might there be? I look forward to hearing
>>> others' input!
>>>
>>> Thanks,
>>> Evan
>>>
>>> [1]
>>> https://github.com/apache/beam/pull/22216/files#diff-28243ab1f9eef144e45a9f6cb2e07fa1cf53c021ceaf733d92351254f38712fd
>>> [2] https://github.com/apache/beam/pull/22216
>>> [3] https://github.com/apache/beam/issues/23525
>>>
>>


Re: SchemaTransformProvider | Java class naming convention

2022-11-15 Thread Ahmed Abualsaud via dev
Thank you for the informative email Damon!
I am in favor of setting an intuitive naming convention early on to reduce
confusion when Schema Transforms become more widespread. I like the
proposed name in your email and I think this convention should also apply
to the rest of the classes involved here, ie:

(action)SchemaTransformConfiguration
and
(action)SchemaTransform

On Tue, Nov 15, 2022 at 2:50 PM Damon Douglas via dev 
wrote:

> Hello Everyone,
>
> Do we like the following Java class naming convention for
> SchemaTransformProviders [1]?  The proposal is:
>
> (Read|Write)SchemaTransformProvider
>
>
> *For those new to Beam, even if this is your first day, consider
> yourselves a welcome contributor to this conversation.  Below are
> definitions/references and a suggested learning guide to understand this
> email.*
>
> Explanation
>
> The  identifies the Beam I/O [2] and Read or Write identifies a
> read or write Ptransform, respectively.
>
> For example, to implement a SchemaTransformProvider [1] for
> BigQueryIO.Write[7], would look like:
>
> BigQueryWriteSchemaTransformProvider
>
>
> And to implement a SchemaTransformProvider for PubSubIO.Read[8] would like
> like:
>
> PubsubReadSchemaTransformProvider
>
>
> Definitions/References
>
> [1] *SchemaTransformProvider*: A way for us to instantiate Beam IO
> transforms using a language agnostic configuration.
> SchemaTransformProvider builds a SchemaTransform[3] from a Beam Row[4] that
> functions as the configuration of that SchemaProvider.
>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.html
>
> [2] *Beam I/O*: PTransform for reading from or writing to sources and
> sinks.
> https://beam.apache.org/documentation/programming-guide/#pipeline-io
>
> [3] *SchemaTransform*: An interface containing a buildTransform method
> that returns a PCollectionRowTuple[5] to PCollectionRowTuple PTransform.
>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/SchemaTransform.html
>
> [4] *Row*: A Beam Row is a generic element of data whose properties are
> defined by a Schema[5].
>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/Row.html
>
> [5] *Schema*: A description of expected field names and their data types.
>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/Schema.html
>
> [6] *PCollectionRowTuple*: A grouping of Beam Rows[4] into a single
> PInput or POutput tagged by a String name.
>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/PCollectionRowTuple.html
>
> [7] *BigQueryIO.Write*: A PTransform for writing Beam elements to a
> BigQuery table.
>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html
>
> [8] *PubSubIO.Read*: A PTransform for reading from Pub/Sub and emitting
> message payloads into a PCollection.
>
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.Read.html
>
> Suggested Learning/Reading to understand this email
>
> 1. https://beam.apache.org/documentation/programming-guide/#overview
> 2. https://beam.apache.org/documentation/programming-guide/#transforms
> (Up to 4.1)
> 3. https://beam.apache.org/documentation/programming-guide/#pipeline-io
> 4. https://beam.apache.org/documentation/programming-guide/#schemas
>


Re: SchemaTransformProvider | Java class naming convention

2022-11-15 Thread Ahmed Abualsaud via dev
>
> Schema-aware transforms are not restricted to I/Os. An arbitrary transform
> can be a Schema-Transform.  Also, designation Read/Write does not map to an
> arbitrary transform. Probably we should try to make this more generic ?
>

Agreed, I suggest keeping everything on the left side of the name unique to
the transform, so that the right side is consistently SchemaTransform |
SchemaTransformProvider | SchemaTransformConfiguration. What do others
think?

Also, probably what's more important is the identifier of the
> SchemaTransformProvider being unique.

FWIW, we came up with a similar generic URN naming scheme for
> cross-language transforms:
> https://beam.apache.org/documentation/programming-guide/#1314-defining-a-urn


The URN convention in that link looks good, it may be a good idea to
replace transform with schematransform in the URN in this case to make a
distinction. ie.
beam:schematransform:org.apache.beam:kafka_read_with_metadata:v1. I will
mention this in the other thread when I go over the comments in the
Supporting SchemaTransforms doc [1].

[1]

 Supporting existing connectors with SchemaTrans...



On Tue, Nov 15, 2022 at 3:41 PM John Casey via dev 
wrote:

> One distinction here is the difference between the URN for a provider /
> transform, and the class name in Java.
>
> We should have a standard for both, but they are distinct
>
> On Tue, Nov 15, 2022 at 3:39 PM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
>>
>>
>> On Tue, Nov 15, 2022 at 11:50 AM Damon Douglas via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hello Everyone,
>>>
>>> Do we like the following Java class naming convention for
>>> SchemaTransformProviders [1]?  The proposal is:
>>>
>>> (Read|Write)SchemaTransformProvider
>>>
>>>
>>> *For those new to Beam, even if this is your first day, consider
>>> yourselves a welcome contributor to this conversation.  Below are
>>> definitions/references and a suggested learning guide to understand this
>>> email.*
>>>
>>> Explanation
>>>
>>> The  identifies the Beam I/O [2] and Read or Write identifies a
>>> read or write Ptransform, respectively.
>>>
>>
>> Schema-aware transforms are not restricted to I/Os. An arbitrary
>> transform can be a Schema-Transform.  Also, designation Read/Write does not
>> map to an arbitrary transform. Probably we should try to make this more
>> generic ?
>>
>> Also, probably what's more important is the identifier of the
>> SchemaTransformProvider being unique. Note the class name (the latter is
>> guaranteed to be unique if we follow the Java package naming guidelines).
>>
>> FWIW, we came up with a similar generic URN naming scheme for
>> cross-language transforms:
>> https://beam.apache.org/documentation/programming-guide/#1314-defining-a-urn
>>
>> Thanks,
>> Cham
>>
>>
>>> For example, to implement a SchemaTransformProvider [1] for
>>> BigQueryIO.Write[7], would look like:
>>>
>>> BigQueryWriteSchemaTransformProvider
>>>
>>>
>>> And to implement a SchemaTransformProvider for PubSubIO.Read[8] would
>>> like like:
>>>
>>> PubsubReadSchemaTransformProvider
>>>
>>>
>>> Definitions/References
>>>
>>> [1] *SchemaTransformProvider*: A way for us to instantiate Beam IO
>>> transforms using a language agnostic configuration.
>>> SchemaTransformProvider builds a SchemaTransform[3] from a Beam Row[4] that
>>> functions as the configuration of that SchemaProvider.
>>>
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.html
>>>
>>> [2] *Beam I/O*: PTransform for reading from or writing to sources and
>>> sinks.
>>> https://beam.apache.org/documentation/programming-guide/#pipeline-io
>>>
>>> [3] *SchemaTransform*: An interface containing a buildTransform method
>>> that returns a PCollectionRowTuple[5] to PCollectionRowTuple PTransform.
>>>
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/SchemaTransform.html
>>>
>>> [4] *Row*: A Beam Row is a generic element of data whose properties are
>>> defined by a Schema[5].
>>>
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/Row.html
>>>
>>> [5] *Schema*: A description of expected field names and their data
>>> types.
>>>
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/Schema.html
>>>
>>> [6] *PCollectionRowTuple*: A grouping of Beam Rows[4] into a single
>>> PInput or POutput tagged by a String name.
>>>
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/values/PCollectionRowTuple.html
>>>
>>> [7] *BigQueryIO.Write*: A PTransform for writing Beam elements to a
>>> BigQuery table.
>>>
>>> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html
>>>
>>> [8] *PubSubIO.Read*: A PTransform for reading from Pub/Sub and emitting
>>> message payloads into a PCollection.
>>>

Re: One Configuration, Many File Write Formats

2022-11-17 Thread Ahmed Abualsaud via dev
Thanks for drafting this Damon! I left some comments on the doc. It's
really cool that users can go to one source with a specified file format
(json, avro, xml, csv, parquet) and retrieve the relevant file writing
PTransform. I also like how the same configuration can be re-used for
different file formats. Looking forward to seeing its implementation :)

Ahmed

On Mon, Nov 14, 2022 at 11:53 AM Damon Douglas 
wrote:

> Hello Everyone,
>
> I hope you are doing well.  The following design document proposes, via a
> single configuration, a producer of a Beam File writing transform
> supporting multiple formats.
> bit.ly/fileioschematransformwriteprovider
>
> For those new to Beam and Schema, I've added a final section of
> suggested pre-requisite reading.  It's important that everyone can
> participate in the conversation at any level of experience, even if this is
> the first day learning Beam.  *Please feel invited to let me know
> anything that isn't clear so this document can strive to include everyone.*
>
> *My personal thoughts on the proposal's value*
>
> I've witnessed many smart people and teams argue and divide over the
> subject of a programming language.  Beam multi-language support allows us
> to join transforms written in various languages, currently Java, Python,
> Go, and experimentally TypeScript into a single unified pipeline.  It's
> Beam's schema and processing of these objects, called Rows, that allow this
> unification possible.  The aforementioned proposal continues this vision
> for producing file and object system sinks via a single language agnostic
> configuration and supporting provider.
>
> Ada Lovelace dreamed of a machine that processed objects instead of just
> numbers, so that they might produce music and the human things of life.
> Through Beam schema awareness, let us live Ada's dream and join our
> multiple languages so that we may end our strife and produce the valuable
> stuff of life.
>
> Sincerely,
>
> Damon
>


Re: [ANNOUNCE] New committer: Ritesh Ghorse

2022-11-04 Thread Ahmed Abualsaud via dev
Congrats Ritesh!

On Fri, Nov 4, 2022 at 10:29 AM Andy Ye via dev  wrote:

> Congrats Ritesh!
>
> On Fri, Nov 4, 2022 at 9:26 AM Kerry Donny-Clark via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations Ritesh, I'm happy to see your hard work and community
>> spirit recognized!
>>
>> On Fri, Nov 4, 2022 at 10:16 AM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Ritesh!
>>>
>>> On Thu, Nov 3, 2022 at 10:12 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Ritesh! This is definitely well deserved!

 On Thu, Nov 3, 2022 at 8:08 PM Robert Burke  wrote:

> Woohoo! Well done Ritesh! :D
>
> On Thu, Nov 3, 2022, 5:04 PM Anand Inguva via dev 
> wrote:
>
>> Congratulations Ritesh.
>>
>> On Thu, Nov 3, 2022 at 7:51 PM Yi Hu via dev 
>> wrote:
>>
>>> Congratulations Ritesh!
>>>
>>> On Thu, Nov 3, 2022 at 7:23 PM Byron Ellis via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations!

 On Thu, Nov 3, 2022 at 4:21 PM Austin Bennett <
 whatwouldausti...@gmail.com> wrote:

> Congratulations, and Thanks @riteshgho...@apache.org!
>
> On Thu, Nov 3, 2022 at 4:17 PM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
>
>> Congrats Ritesh!
>>
>> On Thu, Nov 3, 2022 at 4:16 PM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Ritesh Ghorse (riteshgho...@apache.org)
>>>
>>> Ritesh started contributing to Beam in mid-2021 and has
>>> contributed immensely to bringin the Go SDK to fruition, in 
>>> addition to
>>> contributions to Java and Python and release validation.
>>>
>>> Considering their contributions to the project over this
>>> timeframe, the Beam PMC trusts Ritesh with the responsibilities of 
>>> a Beam
>>> committer. [1]
>>>
>>> Thank you Ritesh! And we are looking to see more of your
>>> contributions!
>>>
>>> Kenn, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>>
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>


Support existing IOs with Schema Transforms

2022-11-03 Thread Ahmed Abualsaud via dev
Hi all,

There has been an effort to add SchemaTransform capabilities to our
connectors to facilitate the use of multi-lang pipelines. I've drafted a
document below that provides guidelines and examples of how to support IOs
with SchemaTransforms. Please take a look and share your thoughts and
suggestions!

 Supporting existing connectors with SchemaTrans...



Best,
Ahmed


Re: [ANNOUNCE] New committer: Yi Hu

2022-11-09 Thread Ahmed Abualsaud via dev
Congrats Yi!

On Wed, Nov 9, 2022 at 1:33 PM Sachin Agarwal via dev 
wrote:

> Congratulations Yi!
>
> On Wed, Nov 9, 2022 at 10:32 AM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Yi Hu (y...@apache.org)
>>
>> Yi started contributing to Beam in early 2022. Yi's contributions are
>> very diverse! I/Os, performance tests, Jenkins, support for Schema logical
>> types. Not only code but a very large amount of code review. Yi is also
>> noted for picking up smaller issues that normally would be left on the
>> backburner and filing issues that he finds rather than ignoring them.
>>
>> Considering their contributions to the project over this timeframe, the
>> Beam PMC trusts Yi with the responsibilities of a Beam committer. [1]
>>
>> Thank you Yi! And we are looking to see more of your contributions!
>>
>> Kenn, on behalf of the Apache Beam PMC
>>
>> [1]
>>
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>


Re: [PROPOSAL] Preparing for 2.47.0 Release

2023-03-23 Thread Ahmed Abualsaud via dev
Thanks for stepping up Jack!

On Thu, Mar 23, 2023 at 12:43 PM Ahmet Altay via dev 
wrote:

> Thank you Jack!
>
> On Wed, Mar 22, 2023 at 8:39 AM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> Hey all,
>>
>> The next (2.47.0) release branch cut is scheduled for April 5th, 2023,
>> according to
>> the release calendar [1].
>>
>> I will be performing this release. My plan is to cut the branch on that
>> date, and cherrypick release-blocking fixes afterwards, if any.
>>
>> Please help me make sure the release goes smoothly by:
>> - Making sure that any unresolved release blocking issues
>> for 2.47.0 should have their "Milestone" marked as "2.47.0 Release" as
>> soon as possible.
>> - Reviewing the current release blockers [2] and remove the Milestone if
>> they don't meet the criteria at [3].
>>
>> Let me know if you have any comments/objections/questions.
>>
>> Thanks,
>>
>> Jack McCluskey
>>
>> [1]
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>> [2] https://github.com/apache/beam/milestone/10
>> [3] https://beam.apache.org/contribute/release-blocking/
>>
>> --
>>
>>
>> Jack McCluskey
>> SWE - DataPLS PLAT/ Dataflow ML
>> RDU
>> jrmcclus...@google.com
>>
>>
>>


Re: Refactor BigQuery SchemaTransforms naming

2023-03-03 Thread Ahmed Abualsaud via dev
Thank you Damon, I left a few comments.

On Fri, Mar 3, 2023 at 11:14 AM Damon Douglas via dev 
wrote:

> Hello Everyone,
>
> This PR brings BigQuery Schema transforms in line with the others in terms
> of naming conventions and use of AutoService.
>
> https://github.com/apache/beam/pull/25706
>
> Best,
>
> Damon
>


Re: [ANNOUNCE] New committer: Anand Inguva

2023-04-21 Thread Ahmed Abualsaud via dev
Congrats Anand!

On Fri, Apr 21, 2023 at 3:18 PM Anand Inguva via dev 
wrote:

> Thanks everyone. Really excited to be a part of Beam Committers.
>
> On Fri, Apr 21, 2023 at 3:07 PM XQ Hu via dev  wrote:
>
>> Congratulations, Anand!!!
>>
>> On Fri, Apr 21, 2023 at 2:31 PM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations, Anand!
>>>
>>> On Fri, Apr 21, 2023 at 2:28 PM Valentyn Tymofieiev via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations!

 On Fri, Apr 21, 2023 at 8:19 PM Jan Lukavský  wrote:

> Congrats Anand!
> On 4/21/23 20:05, Robert Burke wrote:
>
> Congratulations Anand!
>
> On Fri, Apr 21, 2023, 10:55 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> Woohoo, congrats Anand! This is very well deserved!
>>
>> On Fri, Apr 21, 2023 at 1:54 PM Chamikara Jayalath <
>> chamik...@apache.org> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Anand Inguva (ananding...@apache.org)
>>>
>>> Anand has been contributing to Apache Beam for more than a year and
>>> authored and reviewed more than 100 PRs. Anand has been a core 
>>> contributor
>>> to Beam Python SDK and drove the efforts to support Python 3.10 and 
>>> Python
>>> 3.11.
>>>
>>> Considering their contributions to the project over this timeframe,
>>> the Beam PMC trusts Anand with the responsibilities of a Beam
>>> committer. [1]
>>>
>>> Thank you Anand! And we are looking to see more of your
>>> contributions!
>>>
>>> Cham, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>


Re: [ANNOUNCE] New committer: Damon Douglas

2023-04-24 Thread Ahmed Abualsaud via dev
Congrats Damon!

On Mon, Apr 24, 2023 at 5:05 PM Kerry Donny-Clark via dev <
dev@beam.apache.org> wrote:

> Damon, you have done outstanding work to grow and improve Beam and the
> Beam community. Well done, well deserved!
>
> On Mon, Apr 24, 2023 at 4:39 PM XQ Hu via dev  wrote:
>
>> Congrats Damon!!!
>>
>> On Mon, Apr 24, 2023 at 4:34 PM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Damon!
>>>
>>> On Mon, Apr 24, 2023 at 4:03 PM Ahmet Altay via dev 
>>> wrote:
>>>
 Congratulations Damon!

 On Mon, Apr 24, 2023 at 1:00 PM Robert Burke 
 wrote:

> Congratulations Damon!!!
>
> On Mon, Apr 24, 2023, 12:52 PM Kenneth Knowles 
> wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Damon Douglas (damondoug...@apache.org)
>>
>> Damon has contributed widely: Beam Katas, playground, infrastructure,
>> and many IO connectors. Damon does lots of code review in addition to 
>> code.
>> (yes, you can review code as a non-committer!)
>>
>> Considering their contributions to the project over this timeframe,
>> the Beam PMC trusts Damon with the responsibilities of a Beam committer. 
>> [1]
>>
>> Thank you Damon! And we are looking to see more of your contributions!
>>
>> Kenn, on behalf of the Apache Beam PMC
>>
>> [1]
>>
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>


Re: [VOTE] Release 2.46.0, release candidate #1

2023-04-28 Thread Ahmed Abualsaud via dev
@Danny McCormick  @Reuven Lax
 sorry
it's been a while since you looked into this, but do you remember if the
fix in #25642  issue is related
to the recent "ALREADY_EXISTS: The offset is within stream, expected
offset..." errors?

On Fri, Mar 10, 2023 at 7:47 PM Ahmet Altay via dev 
wrote:

> Thank you!
>
> Is there a tracking issue for this known issue? And would the known issues
> section of the release notes link to that?
>
>
> On Fri, Mar 10, 2023 at 11:38 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> We determined that the same issue exists in the 2.45 release, so we are
>> going to continue finalizing the release candidate. Thank you for your
>> patience.
>>
>> Thanks,
>> Danny
>>
>> On Wed, Mar 8, 2023 at 3:15 PM Reuven Lax  wrote:
>>
>>> We are trying to reproduce and debug the issue we saw to validate
>>> whether it was a real regression or not. Will update when we know more.
>>>
>>> On Wed, Mar 8, 2023 at 11:31 AM Danny McCormick <
>>> dannymccorm...@google.com> wrote:
>>>

 @Reuven Lax  found a new potential regression in
 BigQuery I/O, so I have paused the release rollout. I had already pushed
 the Python artifacts and Go tags, but not the Java ones. We have since
 temporarily yanked  the Python release
 and deleted the Go tags, they were live for around an hour. The possible
 regression is in Java, so neither of those releases should be affected, but
 x-lang may not work properly because it depends on versioning. I will
 update this thread with next steps when we know more.

 Thanks,
 Danny
 On Wed, Mar 8, 2023 at 5:59 AM Jan Lukavský  wrote:

> +1 (binding)
>
> Tested Java SDK with Flink and Spark 3 runner.
>
> Thanks,
>  Jan
>
> On 3/8/23 01:53, Valentyn Tymofieiev via dev wrote:
>
> +1. Verified the composition of Python containers and ran Python
> pipelines on Dataflow runner v1 and runner v2.
>
> On Tue, Mar 7, 2023 at 4:11 PM Ritesh Ghorse via dev <
> dev@beam.apache.org> wrote:
>
>> +1 (non-binding)
>> Validated Go SDK quickstart on direct and dataflow runner
>>
>> On Tue, Mar 7, 2023 at 10:54 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> +1 (binding)
>>>
>>> Tested with  https://github.com/Talend/beam-samples/
>>> (Java SDK v8/v11/v17, Spark 3.x runner).
>>>
>>> ---
>>> Alexey
>>>
>>> On 7 Mar 2023, at 07:38, Ahmet Altay via dev 
>>> wrote:
>>>
>>> +1 (binding) - I validated python quickstarts on direct & dataflow
>>> runners.
>>>
>>> Thank you for doing the release!
>>>
>>> On Sat, Mar 4, 2023 at 8:01 AM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1 (binding)

 Validated multi-language Java and Python pipelines.

 On Fri, Mar 3, 2023 at 1:59 PM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> > I have encountered a failure in a Python pipeline running with
> Runner v1:
>
> > RuntimeError: Beam SDK base version 2.46.0 does not match
> Dataflow Python worker version 2.45.0. Please check Dataflow worker 
> startup
> logs and make sure that correct version of Beam SDK is installed.
>
> > We should understand why Python ValidatesRunner tests (which
> have passed)  didn't catch this error.
>
> > This can be remediated in Dataflow containers without  changes
> to the release candidate.
>
> Good catch! I've kicked off a release to fix this, it should be
> done later this evening - I won't be available when it completes, but 
> I
> would expect it to be around 5:00 PST.
>
> On Fri, Mar 3, 2023 at 3:49 PM Danny McCormick <
> dannymccorm...@google.com> wrote:
>
>> Hey Reuven, could you provide some more context on the bug/why it
>> is important? Does it meet the standard in
>> https://beam.apache.org/contribute/release-guide/#7-triage-release-blocking-issues-in-github?
>>
>>
>> The release branch was cut last Wednesday, so that is why it is
>> not included.
>>
>
 Seems like this was a revert of a previous commit that was also not
 included in the 2.46.0 release branch (
 https://github.com/apache/beam/pull/25627) ?

 If so we might not need a new RC but good to confirm.

 Thanks,
 Cham


>> On Fri, Mar 3, 2023 at 3:24 PM Reuven Lax 
>> wrote:
>>
>>> If possible, I would like to see if we could include
>>> https://github.com/apache/beam/pull/25642 as we believe this
>>> bug has been impacting multiple users. This was 

Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Ahmed Abualsaud via dev
Hey everyone,

I was looking at how we use SchemaTransforms in our expansion service. From
what I see, there may be a redundant step in developing SchemaTransforms.
Currently, we have 3 pieces:
- SchemaTransformProvider [1]
- A configuration object
- SchemaTransform [2]

The API is generally used like this:
1. The SchemaTransformProvider takes a configuration object and returns a
SchemaTransform
2. The SchemaTransform is used to build a PTransform according to the
configuration

In these steps, the SchemaTransform class seems unnecessary. We can combine
the two steps if we have SchemaTransformProvider return the PTransform
directly.

We can then remove the SchemaTransform class as it will be obsolete. This
should be safe to do; the only place it's used in our API is here [3], and
that can be simplified if we make this change (we'd just trim `
.buildTransform()` off the end as `provider.from(configRow)` will directly
return the PTransform).

I'd like to first mention that I was not involved in the design process of
this API so I may be missing some information on why it was set up this way.

A few developers already raised questions about how there's seemingly
unnecessary boilerplate involved in making a Java transform portable. I
wasn't involved in the design process of this API so I may be missing some
information, but my assumption is this was designed to follow the pattern
of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
SchemaIO[5] -> PTransform. However, with the newer SchemaTransformProvider
API, we dropped a few methods and reduced the SchemaTransform class to have
a generic buildTransform() method. See the example of
PubsubReadSchemaTransformProvider [6], where the SchemaTransform interface
and buildTransform method are implemented just to satisfy the requirement
that SchemaTransformProvider::from return a SchemaTransform.

I'm bringing this up because if we are looking to encourage contribution to
cross-language use cases, we should make it simpler and less convoluted to
develop portable transforms.

There are a number of SchemaTransforms already developed, but applying
these changes to them should be straightforward. If people think this is a
good idea, I can open a PR and implement them.

Best,
Ahmed

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
[3]
https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
[4]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
[5]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
[6]
https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137


Re: Proposal to reduce the steps to make a Java transform portable

2023-06-22 Thread Ahmed Abualsaud via dev
Thank you all for your input. I have a PR for the changes I mentioned in my
initial email: https://github.com/apache/beam/pull/27202. Please review
when you get a chance!

> perhaps we should consider just going to something Avro for portable
coding rather than something custom

Did you mean using some Avro object (GenericRecord?) besides Beam Row
elements? We would still run into the problem Cham mentioned earlier (of
making sure existing PTransform inputs/outputs are compatible with
cross-language-valid types).

Ahmed

On Tue, May 30, 2023 at 10:53 PM Byron Ellis  wrote:

> Sure, I get that… though perhaps we should consider just going to
> something Avro for portable coding rather than something custom.
>
> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath 
> wrote:
>
>> Input/output PCollection types at least have to be portable Beam types
>> [1] for cross-language to work.
>>
>> I think we restricted schema-aware transforms to PCollection since
>> Row was expected to be an efficient replacement for arbitrary portable Beam
>> types (not sure how true that is in practice currently).
>>
>> Thanks,
>> Cham
>>
>> [1]
>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>
>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis 
>> wrote:
>>
>>> Is it actually necessary for a PTransform that is configured via the
>>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>>> separate concerns and unnecessarily limiting.
>>>
>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath 
>>> wrote:
>>>
>>>> +1 for the simplification.
>>>>
>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>> implement the transform. Hopefully (1) could be done in a concise way that
>>>>> allows for easy consumption from both Java and cross-language.
>>>>>
>>>>
>>>> +1 but I think the hard part today is to convert existing PTransforms
>>>> to be schema-aware transform compatible (for example, change input/output
>>>> types and make sure parameters take Beam Schema compatible types). But this
>>>> makes sense for new transforms.
>>>>
>>>>
>>>>
>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis 
>>>>> wrote:
>>>>>
>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>> auto-generate the associated builder on the PTransform? Either way, more
>>>>>> DRY.
>>>>>>
>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> +1 to this simplification, it's a historical artifact that provides
>>>>>>> no value.
>>>>>>>
>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>> ideally with something as simple as adding a decorator to the PTransform
>>>>>>> itself.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>> service. From what I see, there may be a redundant step in developing
>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>> - A configuration object
>>>>>>>> - SchemaTransform [2]
>>>>>>>>
>>>>>>>> The API is generally used like this:
>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>> returns a SchemaTransform
>>>>>>>> 2. The SchemaTransform is used to build a PTransform according to
>>>>>>>> the configuration
>>>>>>>&g

Re: Proposal to reduce the steps to make a Java transform portable

2023-06-29 Thread Ahmed Abualsaud via dev
Can someone take a quick look at https://github.com/apache/beam/pull/27202?
If things look good, let's try getting it in before the release cut as I'm
also updating our cross-language documentation and would like to include
these changes.

Thank you,
Ahmed

On Thu, Jun 22, 2023 at 8:06 PM Reuven Lax  wrote:

> The goal was to make schema transforms as efficient as hand-written
> coders. We found the avro encoding/decoding to often be quite inefficient,
> which is one reason we didn't use it for schemas.
>
> Our schema encoding is internal to Beam though, and not suggested for use
> external to a pipeline. For sources or sinks we still recommend using Avro
> (or proto).
>
> On Thu, Jun 22, 2023 at 4:14 PM Robert Bradshaw 
> wrote:
>
>> On Thu, Jun 22, 2023 at 2:19 PM Ahmed Abualsaud <
>> ahmedabuals...@google.com> wrote:
>>
>>> Thank you all for your input. I have a PR for the changes I mentioned in
>>> my initial email: https://github.com/apache/beam/pull/27202. Please
>>> review when you get a chance!
>>>
>>> > perhaps we should consider just going to something Avro for portable
>>> coding rather than something custom
>>>
>>> Did you mean using some Avro object (GenericRecord?) besides Beam Row
>>> elements? We would still run into the problem Cham mentioned earlier (of
>>> making sure existing PTransform inputs/outputs are compatible with
>>> cross-language-valid types).
>>>
>>
>> I don't remember why Avro was rejected in favor of our own encoding
>> format, but it probably doesn't make sense to revisit that without
>> understanding the full history. I do know that avro versioning and diamond
>> dependencies cause a lot of pain for users and there's a concerted effort
>> to remove Avro from Beam core altogether.
>>
>> In any case, this is quite orthogonal to the proposal here which we
>> should move forward on.
>>
>>
>>> On Tue, May 30, 2023 at 10:53 PM Byron Ellis 
>>> wrote:
>>>
>>>> Sure, I get that… though perhaps we should consider just going to
>>>> something Avro for portable coding rather than something custom.
>>>>
>>>> On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>>
>>>>> Input/output PCollection types at least have to be portable Beam types
>>>>> [1] for cross-language to work.
>>>>>
>>>>> I think we restricted schema-aware transforms to PCollection
>>>>> since Row was expected to be an efficient replacement for arbitrary
>>>>> portable Beam types (not sure how true that is in practice currently).
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>>>>>
>>>>> On Tue, May 30, 2023 at 1:54 PM Byron Ellis 
>>>>> wrote:
>>>>>
>>>>>> Is it actually necessary for a PTransform that is configured via the
>>>>>> Schema mechanism to also be one that uses RowCoder? Those strike me as 
>>>>>> two
>>>>>> separate concerns and unnecessarily limiting.
>>>>>>
>>>>>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath <
>>>>>> chamik...@google.com> wrote:
>>>>>>
>>>>>>> +1 for the simplification.
>>>>>>>
>>>>>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw <
>>>>>>> rober...@google.com> wrote:
>>>>>>>
>>>>>>>> Yeah. Essentially one needs do (1) name the arguments and (2)
>>>>>>>> implement the transform. Hopefully (1) could be done in a concise way 
>>>>>>>> that
>>>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>>
>>>>>>>
>>>>>>> +1 but I think the hard part today is to convert existing
>>>>>>> PTransforms to be schema-aware transform compatible (for example, change
>>>>>>> input/output types and make sure parameters take Beam Schema compatible
>>>>>>> types). But this makes sense for new transforms.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On

[Design Doc] Auto-generating SchemaTransform wrappers

2023-12-04 Thread Ahmed Abualsaud via dev
Hey everyone,

I've written a design doc for automatically generating SchemaTransform
wrappers. The document is focused on generating Python SDK wrappers, but
the framework would be generic enough to be easily applicable to other
SDKs.

I'd like to share this with you all and gather any feedback. If successful,
this can be applied to other SDKs that support cross-language schema-aware
transform usage.

https://s.apache.org/autogen-wrappers

Thank you!
Ahmed


Re: Bigquery Connector Rate limits

2024-02-22 Thread Ahmed Abualsaud via dev
Hey Taher,

Regarding the first question about what API Beam uses, that depends on the
BigQuery method you set in the connector's configuration. We have 4
different write methods, and a high-level description of each can be found
in the documentation:
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html.
At this point in time, we discourage using the streaming inserts API and
recommend file loads or Storage Write API instead.

For the second question, yes there is a chance you can hit the maximum
quota. When this happens, Beam will just wait a little then retry the write
operation. FYI the Storage Write API quota [1] limits to 3gb/s per project,
compared to streaming insert's 1gb/s [2].

[1] https://cloud.google.com/bigquery/quotas#write-api-limits
[2] https://cloud.google.com/bigquery/quotas#streaming_inserts

On Thu, Feb 22, 2024 at 8:57 AM Taher Koitawala  wrote:

> Hi All,
>   I want to ask questions regarding sinking a very high volume
> stream to Bigquery.
>
> I will read messages from a Pubsub topic and write to Bigquery. In this
> steaming job i am worried about hitting the bigquery streaming inserts
> limit of 1gb per second on streaming Api writes
>
> I am firstly unsure if Beam uses that Api or uses a temp directory to
> write files and commits on intervals which brings me to another question do
> i have to do windowing to save myself from hitting the 1gb per second
> limit?
>
> Please advise. Thanks
>


Re: Proposal to reduce the steps to make a Java transform portable

2024-03-07 Thread Ahmed Abualsaud via dev
gt;>>>>>> allows for easy consumption from both Java and cross-language.
>>>>>>>>>
>>>>>>>>
>>>>>>>> +1 but I think the hard part today is to convert existing
>>>>>>>> PTransforms to be schema-aware transform compatible (for example, 
>>>>>>>> change
>>>>>>>> input/output types and make sure parameters take Beam Schema compatible
>>>>>>>> types). But this makes sense for new transforms.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis <
>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Or perhaps the other way around? If you have a Schema we can
>>>>>>>>>> auto-generate the associated builder on the PTransform? Either way, 
>>>>>>>>>> more
>>>>>>>>>> DRY.
>>>>>>>>>>
>>>>>>>>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1 to this simplification, it's a historical artifact that
>>>>>>>>>>> provides no value.
>>>>>>>>>>>
>>>>>>>>>>> I would love it if we also looked into ways to auto-generate the
>>>>>>>>>>> SchemaTransformProvider (e.g. via introspection if a transform 
>>>>>>>>>>> takes a
>>>>>>>>>>> small number of arguments, or uses the standard builder pattern...),
>>>>>>>>>>> ideally with something as simple as adding a decorator to the 
>>>>>>>>>>> PTransform
>>>>>>>>>>> itself.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I was looking at how we use SchemaTransforms in our expansion
>>>>>>>>>>>> service. From what I see, there may be a redundant step in 
>>>>>>>>>>>> developing
>>>>>>>>>>>> SchemaTransforms. Currently, we have 3 pieces:
>>>>>>>>>>>> - SchemaTransformProvider [1]
>>>>>>>>>>>> - A configuration object
>>>>>>>>>>>> - SchemaTransform [2]
>>>>>>>>>>>>
>>>>>>>>>>>> The API is generally used like this:
>>>>>>>>>>>> 1. The SchemaTransformProvider takes a configuration object and
>>>>>>>>>>>> returns a SchemaTransform
>>>>>>>>>>>> 2. The SchemaTransform is used to build a PTransform according
>>>>>>>>>>>> to the configuration
>>>>>>>>>>>>
>>>>>>>>>>>> In these steps, the SchemaTransform class seems unnecessary. We
>>>>>>>>>>>> can combine the two steps if we have SchemaTransformProvider 
>>>>>>>>>>>> return the
>>>>>>>>>>>> PTransform directly.
>>>>>>>>>>>>
>>>>>>>>>>>> We can then remove the SchemaTransform class as it will be
>>>>>>>>>>>> obsolete. This should be safe to do; the only place it's used in 
>>>>>>>>>>>> our API is
>>>>>>>>>>>> here [3], and that can be simplified if we make this change (we'd 
>>>>>>>>>>>> just trim
>>>>>>>>>>>> `.buildTransform()` off the end as `provider.from(configRow)`
>>>>>>>>>>>> will directly return the PTransform).
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to first mention that I was not involved in the design
>>>>>>>>>>>> process of this API so I may be missing some information on why it 
>>>>>>>>>>>> was set
>>>>>>>>>>>> up this way.
>>>>>>>>>>>>
>>>>>>>>>>>> A few developers already raised questions about how there's
>>>>>>>>>>>> seemingly unnecessary boilerplate involved in making a Java 
>>>>>>>>>>>> transform
>>>>>>>>>>>> portable. I wasn't involved in the design process of this API so I 
>>>>>>>>>>>> may be
>>>>>>>>>>>> missing some information, but my assumption is this was designed 
>>>>>>>>>>>> to follow
>>>>>>>>>>>> the pattern of the previous iteration of this API (SchemaIO):
>>>>>>>>>>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However,
>>>>>>>>>>>> with the newer SchemaTransformProvider API, we dropped a few 
>>>>>>>>>>>> methods and
>>>>>>>>>>>> reduced the SchemaTransform class to have a generic 
>>>>>>>>>>>> buildTransform()
>>>>>>>>>>>> method. See the example of PubsubReadSchemaTransformProvider [6], 
>>>>>>>>>>>> where the
>>>>>>>>>>>> SchemaTransform interface and buildTransform method are
>>>>>>>>>>>> implemented just to satisfy the requirement that
>>>>>>>>>>>> SchemaTransformProvider::from return a SchemaTransform.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm bringing this up because if we are looking to encourage
>>>>>>>>>>>> contribution to cross-language use cases, we should make it 
>>>>>>>>>>>> simpler and
>>>>>>>>>>>> less convoluted to develop portable transforms.
>>>>>>>>>>>>
>>>>>>>>>>>> There are a number of SchemaTransforms already developed, but
>>>>>>>>>>>> applying these changes to them should be straightforward. If 
>>>>>>>>>>>> people think
>>>>>>>>>>>> this is a good idea, I can open a PR and implement them.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Ahmed
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>>>>>>>>>>>> [2]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>>>>>>>>>>>> [3]
>>>>>>>>>>>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>>>>>>>>>>>> [4]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>>>>>>>>>>>> [5]
>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>>>>>>>>>>>> [6]
>>>>>>>>>>>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>>>>>>>>>>>
>>>>>>>>>>>


Supporting Dynamic Destinations in a portable context

2024-03-27 Thread Ahmed Abualsaud via dev
Hey all,

There have been some conversations lately about how best to enable dynamic
destinations in a portable context. Usually, this comes up for
cross-language transforms and more recently for Beam YAML.

I've started a short doc outlining some routes we could take. The purpose
is to establish a good standard for supporting dynamic destinations with
portability, one that can be applied to most use cases and IOs. Please take
a look and add any thoughts!

https://s.apache.org/portable-dynamic-destinations

Best,
Ahmed


Re: Supporting Dynamic Destinations in a portable context

2024-03-27 Thread Ahmed Abualsaud via dev
> This does seem like the best compromise, though I think there will still
end up being performance issues. A common pattern I've seen is that there
is a long common prefix to the dynamic destination followed the dynamic
component. e.g. the destination might be
long/common/path/to/destination/files/. In this case, the
prefix is often much larger than messages themselves and is what gets
effectively encoded in the lambda.

The last option is meant to address this issue. The prefix is specified in
the configuration instead of being present with each message. The "K" in KV
will contain just the part(s) to be appended to the prefix (via string
substitution). This way only the minimal/necessary destination information
gets encoded with the message.

On Wed, Mar 27, 2024 at 12:12 PM Reuven Lax via dev 
wrote:

> This does seem like the best compromise, though I think there will still
> end up being performance issues. A common pattern I've seen is that there
> is a long common prefix to the dynamic destination followed the dynamic
> component. e.g. the destination might be
> long/common/path/to/destination/files/. In this case, the
> prefix is often much larger than messages themselves and is what gets
> effectively encoded in the lambda.
>
> I'm not entirely sure how to address this in a portable context. We might
> simply have to accept the extra overhead when going cross language.
>
> Reuven
>
> On Wed, Mar 27, 2024 at 8:51 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> Thanks for putting this together, it will be a really useful feature to
>> have.
>>
>> I am in favor of the string-pattern approaches. I think we need to
>> support both the {record=..., dest_info=...} and the elide-fields
>> approaches, as the former is nicer when one has a fixed representation for
>> the output record (e.g. a proto or avro schema) and the flattened form for
>> ease of use in more free-form contexts (e.g. when producing records from
>> YAML and SQL).
>>
>> Also left some comments on the doc.
>>
>>
>> On Wed, Mar 27, 2024 at 6:51 AM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey all,
>>>
>>> There have been some conversations lately about how best to enable
>>> dynamic destinations in a portable context. Usually, this comes up for
>>> cross-language transforms and more recently for Beam YAML.
>>>
>>> I've started a short doc outlining some routes we could take. The
>>> purpose is to establish a good standard for supporting dynamic destinations
>>> with portability, one that can be applied to most use cases and IOs. Please
>>> take a look and add any thoughts!
>>>
>>> https://s.apache.org/portable-dynamic-destinations
>>>
>>> Best,
>>> Ahmed
>>>
>>