Re: [ANNOUNCE] New Committer: Svetak Sundhar

2024-02-12 Thread Byron Ellis via dev
Congrats Svetak!

On Mon, Feb 12, 2024 at 6:57 PM Shunping Huang via dev 
wrote:

> Congratulations, Svetak!
>
> On Mon, Feb 12, 2024 at 9:50 PM XQ Hu via dev  wrote:
>
>> Great job, Svetak! Thanks for all your contributions to Beam!!!
>>
>> On Mon, Feb 12, 2024 at 4:44 PM Valentyn Tymofieiev via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats, Svetak!
>>>
>>> On Mon, Feb 12, 2024 at 11:20 AM Kenneth Knowles 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Svetak Sundhar (sve...@apache.org).

 Svetak has been with Beam since 2021. Svetak has contributed code to
 many areas of Beam, including notebooks, Beam Quest, dataframes, and IOs.
 We also want to especially highlight the effort Svetak has put into
 improving Beam's documentation, participating in release validation, and
 evangelizing Beam.

 Considering his contributions to the project over this timeframe, the
 Beam PMC trusts Svetak with the responsibilities of a Beam committer. [1]

 Thank you Svetak! And we are looking to see more of your contributions!

 Kenn, on behalf of the Apache Beam PMC

 [1]

 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>


Re: Implementing tuple type support in for ClickHouse connector

2023-12-04 Thread Byron Ellis via dev
Indeed, hence me cheaping out and just tagging the base type with the one I
want (in my case things like "UUID" which are just byte arrays). :-)

In any case, the (very very very early) draft PR is here:
https://github.com/apache/beam/pull/29569

On Mon, Dec 4, 2023 at 9:26 AM Robert Bradshaw via dev 
wrote:

> Note that Logical types are not as portable (e.g. for cross-langauge use).
>
> On Mon, Dec 4, 2023 at 9:18 AM Alexey Romanenko
>  wrote:
> >
> > Did you take a look by chance on
> org.apache.beam.sdk.schemas.Schema.LogicalType? Can it be helpful for your
> case?
> >
> > On 4 Dec 2023, at 12:02, Mark Zitnik  wrote:
> >
> > Yes I know it is done in  org.apache.beam.sdk.io.clickhouse.TableSchema
> (Did it for several other types), but since Tuple is a nested type that can
> hold any number of other ClickHouse types I was wondering what is the best
> type from the Apache Beam side in order to implement it.
> >
> > Mark
> >
> > On Mon, Dec 4, 2023 at 12:24 PM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>
> >> Hi Mark,
> >>
> >> What do you mean by “support” in this case? To map this ClickHouse data
> type to a Beam Schema data type as it’s done in 
> org.apache.beam.sdk.io.clickhouse.TableSchema
> for other types or something else?
> >>
> >> —
> >> Alexey
> >>
> >> On 3 Dec 2023, at 10:35, Mark Zitnik  wrote:
> >>
> >> Hi Team,
> >>
> >> I am one of the committers of the ClickHouse integration team.
> >> I need to add support for Tuple in the ClickHouse connector for Apache
> Beam. What is the best approval for implementing that? Tuple(
> https://clickhouse.com/docs/en/sql-reference/data-types/tuple) in a
> nested data type(
> https://clickhouse.com/docs/en/sql-reference/data-types#data_types).
> >> If you can point me to a reference on other connectors
> >>
> >> Thanks
> >> -MZ
> >>
> >>
> >>
> >>
> >
>


Re: Implementing tuple type support in for ClickHouse connector

2023-12-04 Thread Byron Ellis via dev
I'm not all that familiar with Clickhouse, but from the docs it looks like
Tuples are basically some sort of anonymous structure so I'm guessing
you'll want to encode it as a Row itself... that's a lossy conversion
though so probably some sort of Logical Type that wraps the Row object is
the "right answer." I'm looking at something similar right now for adding
Iceberg support, though I may cheap out and just use field options to
capture the original iceberg type.

On Mon, Dec 4, 2023 at 3:03 AM Mark Zitnik  wrote:

> Yes I know it is done in  org.apache.beam.sdk.io.clickhouse.TableSchema
> (Did it for several other types), but since Tuple is a nested type that can
> hold any number of other ClickHouse types I was wondering what is the best
> type from the Apache Beam side in order to implement it.
>
> Mark
>
> On Mon, Dec 4, 2023 at 12:24 PM Alexey Romanenko 
> wrote:
>
>> Hi Mark,
>>
>> What do you mean by “support” in this case? To map this ClickHouse data
>> type to a Beam Schema data type as it’s done in
>> org.apache.beam.sdk.io.clickhouse.TableSchema for other types or
>> something else?
>>
>> —
>> Alexey
>>
>> On 3 Dec 2023, at 10:35, Mark Zitnik  wrote:
>>
>> Hi Team,
>>
>> I am one of the committers of the ClickHouse integration team.
>> I need to add support for Tuple in the ClickHouse connector for
>> Apache Beam. What is the best approval for implementing that? Tuple(
>> https://clickhouse.com/docs/en/sql-reference/data-types/tuple) in a
>> nested data type(
>> https://clickhouse.com/docs/en/sql-reference/data-types#data_types).
>> If you can point me to a reference on other connectors
>>
>> Thanks
>> -MZ
>>
>>
>>
>>
>>


Re: [YAML] Aggregations

2023-10-19 Thread Byron Ellis via dev
On Thu, Oct 19, 2023 at 9:28 AM Robert Bradshaw  wrote:

> On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis  wrote:
> >
> > Rill is definitely SQL-oriented but I think that's going to be the most
> common. Dataframes are explicitly modeled on the relational approach so
> that's going to look a lot like SQL,
>
> I think pretty much any approach that fits here is going to be
> relational, meaning you choose a set of columns to group on, a set of
> columns to aggregate, and how to aggregate. The big open question is
> what syntax to use for the "how."
>
> Dataframe aggregation is probably a good example to look at. Here we
> have panda and R in particular as concrete instances. It should also
> be easy to support different aggregations over different (or the same)
> columns. Pandas can take a list of (or mapping to) functions in its
> groupby().agg(). R doesn't seem to make this very easy...
>

In R it's pretty similar if you're using "the tidyverse" (a colloquial name
for the R packages on which so far as I know pandas is based). Using dplyr
it would be something along the lines of

table %>% group_by(ColumnName) %>% summarize(OtherColumnAvg =
mean(OtherColumn))

(R is much more lazily evaluated than Python so the "mean(OtherColumn)" is
evaluated in the context of "table" by summarize. You are not wrong in
thinking "this is difficult to compile" :-)



>
> > which leaves us with S-style formulas (which I like but are pretty niche)
>
> I'm curious, what are these?
>

The other way of doing the above in R, which comes from S-PLUS is to use
formula constructs. It would look something like this:

aggregate(OtherColumn ~ ColumnName,data=table,mean)

~ is the "formula operator" so it creates something of type formula.

There are even older pivot table approaches as well, but that's probably
the most common pattern outside of the tidyverse. It would be technically
feasible to also do aggregate(mean(OtherColumn) + median(YetAnotherColumn)
~ ColumnName,data=table) though I can't remember if there's a variant that
supports that construct






>
> >  and I guess pivot tables coming from the spreadsheet world. Does make
> me wonder what Rails' ORM looks like these days (I last used v4), it had
> some aggregation support and was pretty declarative...
> >
> > On Wed, Oct 18, 2023 at 6:06 PM Robert Bradshaw 
> wrote:
> >>
> >> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis 
> wrote:
> >> >
> >> > Is it worth taking a look at similar prior art in the space?
> >>
> >> +1. Pointers welcome.
> >>
> >> > The first one that comes to mind is Transform, but with the dbt labs
> acquisition that spec is a lot harder to find. Rill is pretty similar
> though.
> >>
> >> Rill seems to be very SQL-based.
> >>
> >> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >> >>
> >> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> >> feature for even writing a WordCount is the ability to do
> Aggregations
> >> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> >> data (which has some precedence in our other languages, see the links
> >> >> below). The key components the user needs to specify are (1) the key
> >> >> fields on which the grouping will take place, (2) the fields
> >> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> >> fn to use.
> >> >>
> >> >> A straw-man example could be something like
> >> >>
> >> >> type: Aggregating
> >> >> config:
> >> >>   key: [field1, field2]
> >> >>   aggregating:
> >> >> total_cost:
> >> >>   fn: sum
> >> >>   value: cost
> >> >> max_cost:
> >> >>   fn: max
> >> >>   value: cost
> >> >>
> >> >> This would basically correspond to the SQL expression
> >> >>
> >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as
> max_cost
> >> >> from table GROUP BY field1, field2"
> >> >>
> >> >> (though I'm not requiring that we use this as an implementation
> >> >> strategy). I do not think we need a separate (non aggregating)
> >> >> Grouping operation, this can be accomplished by having a concat-style
> >> >> combiner.
> >> >>
> >> >> There are still some open questions here, notably around how to
> >> >> specify the aggregation fns themselves. We could of course provide a
> >> >> number of built-ins (like SQL does). This gets into the question of
> >> >> how and where to document this complete set, but some basics should
> >> >> take us pretty far. Many aggregators, however, are parameterized
> (e.g.
> >> >> quantiles); where do we put the parameters? We could go with
> something
> >> >> like
> >> >>
> >> >> fn:
> >> >>   type: ApproximateQuantiles
> >> >>   config:
> >> >> n: 10
> >> >>
> >> >> but others are even configured by functions themselves (e.g. LargestN
> >> >> that wants a comparator Fn). Maybe we decide not to support these
> >> >> (yet?)
> >> >>
> >> >> One thing I think we should 

Re: [YAML] Aggregations

2023-10-19 Thread Byron Ellis via dev
Rill is definitely SQL-oriented but I think that's going to be the most
common. Dataframes are explicitly modeled on the relational approach so
that's going to look a lot like SQL, which leaves us with S-style formulas
(which I like but are pretty niche) and I guess pivot tables coming from
the spreadsheet world. Does make me wonder what Rails' ORM looks like these
days (I last used v4), it had some aggregation support and was pretty
declarative...

On Wed, Oct 18, 2023 at 6:06 PM Robert Bradshaw  wrote:

> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis  wrote:
> >
> > Is it worth taking a look at similar prior art in the space?
>
> +1. Pointers welcome.
>
> > The first one that comes to mind is Transform, but with the dbt labs
> acquisition that spec is a lot harder to find. Rill is pretty similar
> though.
>
> Rill seems to be very SQL-based.
>
> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> feature for even writing a WordCount is the ability to do Aggregations
> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> data (which has some precedence in our other languages, see the links
> >> below). The key components the user needs to specify are (1) the key
> >> fields on which the grouping will take place, (2) the fields
> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> fn to use.
> >>
> >> A straw-man example could be something like
> >>
> >> type: Aggregating
> >> config:
> >>   key: [field1, field2]
> >>   aggregating:
> >> total_cost:
> >>   fn: sum
> >>   value: cost
> >> max_cost:
> >>   fn: max
> >>   value: cost
> >>
> >> This would basically correspond to the SQL expression
> >>
> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> >> from table GROUP BY field1, field2"
> >>
> >> (though I'm not requiring that we use this as an implementation
> >> strategy). I do not think we need a separate (non aggregating)
> >> Grouping operation, this can be accomplished by having a concat-style
> >> combiner.
> >>
> >> There are still some open questions here, notably around how to
> >> specify the aggregation fns themselves. We could of course provide a
> >> number of built-ins (like SQL does). This gets into the question of
> >> how and where to document this complete set, but some basics should
> >> take us pretty far. Many aggregators, however, are parameterized (e.g.
> >> quantiles); where do we put the parameters? We could go with something
> >> like
> >>
> >> fn:
> >>   type: ApproximateQuantiles
> >>   config:
> >> n: 10
> >>
> >> but others are even configured by functions themselves (e.g. LargestN
> >> that wants a comparator Fn). Maybe we decide not to support these
> >> (yet?)
> >>
> >> One thing I think we should support, however, is referencing custom
> >> CombineFns. We have some precedent for this with our Fns from
> >> MapToFields, where we accept things like inline lambdas and external
> >> references. Again the topic of how to configure them comes up, as
> >> these custom Fns are more likely to be parameterized than Map Fns
> >> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> >> MapFns as well). Maybe we allow
> >>
> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> and match per Fn)
> >> fn:
> >>   type: ???
> >>   # should these be nested as config?
> >>   name: fully.qualiied.name
> >>   path: /path/to/defining/file
> >>   args: [...]
> >>   kwargs: {...}
> >>
> >> which would invoke the constructor.
> >>
> >> I'm also open to other ways of naming/structuring these essential
> >> parameters if it makes things more clear.
> >>
> >> - Robert
> >>
> >>
> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >>
> >> [1] One can of course use SqlTransform for this, but I'm leaning
> >> towards offering something more native.
>


Re: [YAML] Aggregations

2023-10-18 Thread Byron Ellis via dev
Is it worth taking a look at similar prior art in the space? The first one
that comes to mind is Transform, but with the dbt labs acquisition that
spec is a lot harder to find. Rill
 is pretty similar though.

On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev 
wrote:

> Beam Yaml has good support for IOs and mappings, but one key missing
> feature for even writing a WordCount is the ability to do Aggregations
> [1]. While the traditional Beam primitive is GroupByKey (and
> CombineValues), we're eschewing KVs in the notion of more schema'd
> data (which has some precedence in our other languages, see the links
> below). The key components the user needs to specify are (1) the key
> fields on which the grouping will take place, (2) the fields
> (expressions?) involved in the aggregation, and (3) what aggregating
> fn to use.
>
> A straw-man example could be something like
>
> type: Aggregating
> config:
>   key: [field1, field2]
>   aggregating:
> total_cost:
>   fn: sum
>   value: cost
> max_cost:
>   fn: max
>   value: cost
>
> This would basically correspond to the SQL expression
>
> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> from table GROUP BY field1, field2"
>
> (though I'm not requiring that we use this as an implementation
> strategy). I do not think we need a separate (non aggregating)
> Grouping operation, this can be accomplished by having a concat-style
> combiner.
>
> There are still some open questions here, notably around how to
> specify the aggregation fns themselves. We could of course provide a
> number of built-ins (like SQL does). This gets into the question of
> how and where to document this complete set, but some basics should
> take us pretty far. Many aggregators, however, are parameterized (e.g.
> quantiles); where do we put the parameters? We could go with something
> like
>
> fn:
>   type: ApproximateQuantiles
>   config:
> n: 10
>
> but others are even configured by functions themselves (e.g. LargestN
> that wants a comparator Fn). Maybe we decide not to support these
> (yet?)
>
> One thing I think we should support, however, is referencing custom
> CombineFns. We have some precedent for this with our Fns from
> MapToFields, where we accept things like inline lambdas and external
> references. Again the topic of how to configure them comes up, as
> these custom Fns are more likely to be parameterized than Map Fns
> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> MapFns as well). Maybe we allow
>
> language: python. # like MapToFields (and here it'd be harder to mix
> and match per Fn)
> fn:
>   type: ???
>   # should these be nested as config?
>   name: fully.qualiied.name
>   path: /path/to/defining/file
>   args: [...]
>   kwargs: {...}
>
> which would invoke the constructor.
>
> I'm also open to other ways of naming/structuring these essential
> parameters if it makes things more clear.
>
> - Robert
>
>
> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
>
> [1] One can of course use SqlTransform for this, but I'm leaning
> towards offering something more native.
>


Re: [NOTICE] Deprecation Avro classes in "core" and use "extensions/avro" instead for Java SDK

2023-10-18 Thread Byron Ellis via dev
Awesome!

On Wed, Oct 18, 2023 at 1:14 PM Alexey Romanenko 
wrote:

> Heads up!
>
> Finally, all Avro-related code and Avro dependency, that was deprecated
> before (see a message above), has been removed from Beam Java SDK “core”
> module [1]. We believe that it was a sufficient number of Beam releases
> (six!) that passed after this code had been deprecated and users had an
> opportunity to switch to a new Avro extension as it was recommended before.
>
> We did our best to make this transition as smooth as possible but, please,
> let me know you find any failed tests or any other strange behavior because
> of this change.
>
> Thanks,
> Alexey
>
>
> [1] https://github.com/apache/beam/pull/27851/
>
>
> On 22 Feb 2023, at 20:21, Robert Bradshaw via dev 
> wrote:
>
> Thanks for pushing this through!
>
> On Wed, Feb 22, 2023 at 10:38 AM Alexey Romanenko
>  wrote:
>
>
> Hi all,
>
> As a part of migration the Avro-related classes from Java SDK “core”
> module to a dedicated extension [1] (as it was discussed here [2] and here
> [3]), two important PRs has been merged [4][5]. Therefore, old Avro-related
> classes became deprecated in “core” (still possible to use but not
> recommended) and all other Beam modules, that depended on them, switched to
> use "extensions/avro” instead.
>
> We did our best to make this change smooth, compatible and not breaking
> but, since it was one of the oldest part of “core”, then everything,
> unfortunatelly, is possible and we probably could miss something despite of
> all efforts. So, considering that, I’d like to ask community to run any
> kind of tests or pipelines that utilise, for example, AvroCoder or
> AvroUtils or any other related Avro classes and check if new changes
> doesn’t break something and everything works as expected.
>
> —
> Alexey
>
> [1] https://github.com/apache/beam/issues/24292
> [2] https://lists.apache.org/thread/mz8hvz8dwhd0tzmv2lyobhlz7gtg4gq7
> [3] https://lists.apache.org/thread/47oz1mlwj0orvo1458v5pw5c20bwt08q
> [4] https://github.com/apache/beam/pull/24992
> [5] https://github.com/apache/beam/pull/25534
>
>
>
>


Re: [ANNOUNCE] New Committer: Sam Whittle

2023-10-16 Thread Byron Ellis via dev
Congrats Sam!

On Mon, Oct 16, 2023 at 10:32 AM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> Congrats Sam!
>
> On Mon, Oct 16, 2023 at 9:32 AM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Sam Whittle (scwhit...@apache.org).
>>
>> Sam has been contributing to Beam since 2016! In particular, he
>> specializes in streaming and the Dataflow Java worker but his contributions
>> expand naturally from there to the Java SDK, IOs, and even a bit of Python
>> :-). Sam has contributed a ton of code over the years and is generous in
>> code review and sharing his expertise.
>>
>> Considering his contributions to the project over this timeframe, the
>> Beam PMC trusts Sam with the responsibilities of a Beam committer. [1]
>>
>> Thank you Sam! And we are looking to see more of your contributions!
>>
>> Kenn, on behalf of the Apache Beam PMC
>>
>> [1]
>>
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>


Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Byron Ellis via dev
That's a good point--- in the dbt case they're almost always treating that
as a precomputation. I suppose a JinjaTransform isn't totally insane, but
not sure I'd want to introduce Yet Another Way Of Writing A Lambda :-)

On Tue, Oct 10, 2023 at 3:22 PM Robert Bradshaw  wrote:

> On Tue, Oct 10, 2023 at 7:22 AM Byron Ellis via dev 
> wrote:
>
>> FWIW dbt (which is also YAML and has this problem for other reasons) does
>> something like this. It also chooses to assume that everything is a string
>> but allows users to use the Jinja templating language to make those strings
>> dynamic where needed.
>>
>
> Are these only for values that are filled in at runtime (i.e. jinja is a
> pre-processor used before the yaml file is passed to dbt) or can they be
> plugged in (possibly on a per-record basis) from the data itself? (FWIW, I
> think we also want to allow some kind of templating like this to allow for
> parameterized composite PTransforms to be define in YAML and additionally
> we'll need it for YAML-defined templates (not to be confused with the YAML
> template which is a single flext template whose single parameter is the
> YAML file itself).
>
>
>> Syntactically I think that's a bit nicer to look at than the shell script
>> style and saves having to remember the difference between $() and ${}
>>
>
> +1
>
>
>> On Tue, Oct 10, 2023 at 7:10 AM Kenneth Knowles  wrote:
>>
>>> Since I've been in GHA files lately...
>>>
>>> I think they have a very useful pattern which we could borrow from or
>>> learn from, where setting up the variables happens separately, like
>>> https://github.com/apache/beam/blob/57821c191d322f9f21c01a34c55e0c40eda44f1e/.github/workflows/build_release_candidate.yml#L270
>>>
>>> If we called the section "vars" and then the config could use the vars
>>> in the destination. I'm making this example deliberately a little gross:
>>>
>>>  - vars:
>>> - USER_REGION: $.user.metadata.region
>>> - USER_GROUP: $.user.groups[0].name
>>>  - config:
>>> - path:
>>> gs://output-bucket-${vars.USER_REGION}/files/${vars.USER_GROUP}-${fileio.SHARD_NUM}-${fileio.WINDOW}
>>>
>>> I think it strikes a good balance between arbitrary lambdas and just a
>>> prefix/suffix control, giving a really easy place where we can say "the
>>> whole value of this YAML field is a path expression into the structured
>>> data"
>>>
>>> Kenn
>>>
>>> On Mon, Oct 9, 2023 at 6:09 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> I would say:
>>>>
>>>> sink:
>>>>   type: WriteToParquet
>>>>   config:
>>>> path: /beam/filesytem/dest
>>>> prefix: 
>>>> suffix: 
>>>>
>>>> Underlying SDK will add the middle part of the file names to make sure
>>>> that files generated by various bundles/windows/shards do not conflict.
>>>>
>>>> This will satisfy the vast majority of use-cases I believe. Fully
>>>> customizing the file pattern sounds like a more advanced use case that can
>>>> be left for "real" SDKs.
>>>>
>>>> For dynamic destinations, I think just making the "path" component
>>>> support  a lambda that is parameterized by the input should be adequate
>>>> since this allows customers to direct files written to different
>>>> destination directories.
>>>>
>>>> sink:
>>>>   type: WriteToParquet
>>>>   config:
>>>> path: 
>>>> prefix: 
>>>> suffix: 
>>>>
>>>> I'm not sure what would be the best way to specify a lambda here
>>>> though. Maybe a regex or the name of a Python callable ?
>>>>
>>>> Thanks,
>>>> Cham
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 9, 2023 at 2:06 PM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> .On Mon, Oct 9, 2023 at 1:49 PM Reuven Lax  wrote:
>>>>>
>>>>>> Just FYI - the reason why names (including prefixes) in
>>>>>> DynamicDestinations were parameterized via a lambda instead of just 
>>>>>> having
>>>>>> the user add it via MapE

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Byron Ellis via dev
FWIW dbt (which is also YAML and has this problem for other reasons) does
something like this. It also chooses to assume that everything is a string
but allows users to use the Jinja templating language to make those strings
dynamic where needed. Syntactically I think that's a bit nicer to look at
than the shell script style and saves having to remember the difference
between $() and ${}

On Tue, Oct 10, 2023 at 7:10 AM Kenneth Knowles  wrote:

> Since I've been in GHA files lately...
>
> I think they have a very useful pattern which we could borrow from or
> learn from, where setting up the variables happens separately, like
> https://github.com/apache/beam/blob/57821c191d322f9f21c01a34c55e0c40eda44f1e/.github/workflows/build_release_candidate.yml#L270
>
> If we called the section "vars" and then the config could use the vars in
> the destination. I'm making this example deliberately a little gross:
>
>  - vars:
> - USER_REGION: $.user.metadata.region
> - USER_GROUP: $.user.groups[0].name
>  - config:
> - path:
> gs://output-bucket-${vars.USER_REGION}/files/${vars.USER_GROUP}-${fileio.SHARD_NUM}-${fileio.WINDOW}
>
> I think it strikes a good balance between arbitrary lambdas and just a
> prefix/suffix control, giving a really easy place where we can say "the
> whole value of this YAML field is a path expression into the structured
> data"
>
> Kenn
>
> On Mon, Oct 9, 2023 at 6:09 PM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
>> I would say:
>>
>> sink:
>>   type: WriteToParquet
>>   config:
>> path: /beam/filesytem/dest
>> prefix: 
>> suffix: 
>>
>> Underlying SDK will add the middle part of the file names to make sure
>> that files generated by various bundles/windows/shards do not conflict.
>>
>> This will satisfy the vast majority of use-cases I believe. Fully
>> customizing the file pattern sounds like a more advanced use case that can
>> be left for "real" SDKs.
>>
>> For dynamic destinations, I think just making the "path" component
>> support  a lambda that is parameterized by the input should be adequate
>> since this allows customers to direct files written to different
>> destination directories.
>>
>> sink:
>>   type: WriteToParquet
>>   config:
>> path: 
>> prefix: 
>> suffix: 
>>
>> I'm not sure what would be the best way to specify a lambda here though.
>> Maybe a regex or the name of a Python callable ?
>>
>> Thanks,
>> Cham
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Oct 9, 2023 at 2:06 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> .On Mon, Oct 9, 2023 at 1:49 PM Reuven Lax  wrote:
>>>
 Just FYI - the reason why names (including prefixes) in
 DynamicDestinations were parameterized via a lambda instead of just having
 the user add it via MapElements is performance. We discussed something
 along the lines of what you are suggesting (essentially having the user
 create a KV where the key contained the dynamic information). The problem
 was that often the size of the generated filepath was often much larger
 (sometimes by 2 OOM) than the information in the record, and there was a
 desire to avoid record blowup. e.g. the record might contain a single
 integer userid, and the filepath prefix would then be
 /long/path/to/output/users/. This was especially bad in cases where the
 data had to be shuffled, and the existing dynamic destinations method
 allowed extracting the filepath only _after_  the shuffle.

>>>
>>> That is a consideration I hadn't thought much of, thanks for
>>> bringing this up.
>>>
>>>
 Now there may not be any good way to keep this benefit in a
 declarative approach such as YAML (or at least a good easy way - we could
 always allow the user to pass in a SQL expression to extract the filename
 from the record!), but we should keep in mind that this might mean that
 YAML-generated pipelines will be less efficient for certain use cases.

>>>
>>> Yep, it's not as straightforward to do in a declarative way. I would
>>> like to avoid mixing UDFs (with their associated languages and execution
>>> environments) if possible. Though I'd like the performance of a
>>> "straightforward" YAML pipeline to be that which one can get writing
>>> straight-line Java (and possibly better, if we can leverage the structure
>>> of schemas everywhere) this is not an absolute requirement for all
>>> features.
>>>
>>> I wonder if separating out a constant prefix vs. the dynamic stuff could
>>> be sufficient to mitigate the blow-up of pre-computing this in most cases
>>> (especially in the context of a larger pipeline). Alternatively, rather
>>> than just a sharding pattern, one could have a full filepattern that
>>> includes format parameters for dynamically computed bits as well as the
>>> shard number, windowing info, etc. (There are pros and cons to this.)
>>>
>>>
 On Mon, Oct 9, 2023 at 12:37 PM Robert 

Re: [PROPOSAL] [Nice-to-have] CI job names and commands that match

2023-10-10 Thread Byron Ellis via dev
I'm +1 on standardizing the names and while I don't have a strong opinion
on which standard (so long as it's only one) using the Gradle name seems
like a perfectly good choice... I don't know the GHA setup well enough, but
would that help maintain those? Presumably the various actions all
eventually call gradle and now the name of the action to run tells you
which one you are trying to call?

On Tue, Oct 10, 2023 at 6:57 AM Kenneth Knowles  wrote:

> FWIW I aware of the README in
> https://github.com/apache/beam/tree/master/.test-infra/jenkins that lists
> the phrases alongside the jobs. This is just wasted work to maintain IMO.
>
> Kenn
>
> On Tue, Oct 10, 2023 at 9:46 AM Kenneth Knowles  wrote:
>
>> *Proposal:* make all the job names exactly match the GH comment to run
>> them and make it also as close as possible to how to reproduce locally
>>
>> *Example problems*:
>>
>>  - We have really silly redundant jobs results like 'Chicago Taxi Example
>> on Dataflow ("Run Chicago Taxi on Dataflow")' and 'Python_Xlang_IO_Dataflow
>> ("Run Python_Xlang_IO_Dataflow PostCommit")'
>>
>>  - We have jobs that there's no way you could guess the command 'Google
>> Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)'
>>
>>  - (nit) We are weirdly inconsistent about using spaces vs underscores. I
>> don't think any of our infrastructure cares about this.
>>
>> *Extra proposal*: make the job name also the local command, where
>> possible
>>
>> *Example: *
>> https://github.com/apache/beam/blob/master/.github/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow.yml
>>
>>  - This runs :runners:google-cloud-dataflow-java:validatesRunner
>>  - So make the status label
>> ":runners:google-cloud-dataflow-java:validatesRunner"
>>  - "Run :runners:google-cloud-dataflow-java:validatesRunner" as comment
>>
>> If I want to run it locally, yes there are GCP things I have to set up,
>> but I know the gradle command now.
>>
>> *Corollary*: remove "postcommit" and "precommit" from names, because
>> whether a suite runs before merge or after merge is not a property of the
>> suite.
>>
>> *Caveats*: I haven't been that involved. I didn't do this to Jenkins
>> because they are going away. I didn't do anything to GHA because I don't
>> know if they are ready or in flux.
>>
>> I know this is the sort of thing that invites bikeshedding. It just would
>> save me a few minutes when puzzling out what to care about and how to kick
>> jobs on the release branch validation PR.
>>
>> I'm happy to scrape through the existing stuff and align it. Perfect task
>> for when my brain is too tired for other work.
>>
>> Kenn
>>
>


Re: [QUESTION] Why no auto labels?

2023-10-04 Thread Byron Ellis via dev
This feels like something that maybe should be more explicit? Overloading
the transform name to provide a unique stable id feels like perhaps too
much magic... also maybe feels like this is leaking specific runner
behavior? I get that it's convenient

On Wed, Oct 4, 2023 at 9:16 AM Robert Bradshaw via user <
u...@beam.apache.org> wrote:

> BeamJava and BeamPython have the exact same behavior: transform names
> within must be distinct [1]. This is because we do not necessarily know at
> pipeline construction time if the pipeline will be streaming or batch, or
> if it will be updated in the future, so the decision was made to impose
> this restriction up front. Both will auto-generate a name for you if one is
> not given, but will do so deterministically (not depending on some global
> context) to avoid potential update problems.
>
> [1] Note that this applies to the fully qualified transform name, so the
> naming only has to be distinct within a composite transform (or at the top
> level--the pipeline itself is isomorphic to a single composite transform).
>
> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran 
> wrote:
>
>> Cross posting this thread to dev@ to see if this is intentional behavior
>> or if it's something worth changing for the python SDK
>>
>> On Tue, Oct 3, 2023, 10:10 PM XQ Hu via user 
>> wrote:
>>
>>> That suggests the default label is created as that, which indeed causes
>>> the duplication error.
>>>
>>> On Tue, Oct 3, 2023 at 9:15 PM Joey Tran 
>>> wrote:
>>>
 Not sure what that suggests

 On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user 
 wrote:

> Looks like this is the current behaviour. If you have `t =
> beam.Filter(identity_filter)`, `t.label` is defined as
> `Filter(identity_filter)`.
>
> On Mon, Oct 2, 2023 at 9:25 AM Joey Tran 
> wrote:
>
>> You don't have to specify the names if the callable you pass in is
>> /different/ for two `beam.Map`s, but  if the callable is the same you 
>> must
>> specify a label. For example, the below will raise an exception:
>>
>> ```
>> | beam.Filter(identity_filter)
>> | beam.Filter(identity_filter)
>> ```
>>
>> Here's an example on playground that shows the error message you get
>> [1]. I marked every line I added with a "# ++".
>>
>> It's a contrived example, but using a map or filter at the same
>> pipeline level probably comes up often, at least in my inexperience. For
>> example, you. might have a pipeline that partitions a pcoll into three
>> different pcolls, runs some transforms on them, and then runs the same 
>> type
>> of filter on each of them.
>>
>> The case that happens most often for me is using the `assert_that`
>> [2] testing transform. In this case, I think often users will really have
>> no need for a disambiguating label as they're often just writing unit 
>> tests
>> that test a few different properties of their workflow.
>>
>> [1] https://play.beam.apache.org/?sdk=python=hIrm7jvCamW
>> [2]
>> https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.testing.util.html#apache_beam.testing.util.assert_that
>>
>> On Mon, Oct 2, 2023 at 9:08 AM Bruno Volpato via user <
>> u...@beam.apache.org> wrote:
>>
>>> If I understand the question correctly, you don't have to specify
>>> those names.
>>>
>>> As Reuven pointed out, it is probably a good idea so you have a
>>> stable / deterministic graph.
>>> But in the Python SDK, you can simply use pcollection | map_fn,
>>> instead of pcollection | 'Map' >> map_fn.
>>>
>>> See an example here
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/group_with_coder.py#L100-L116
>>>
>>>
>>> On Sun, Oct 1, 2023 at 9:08 PM Joey Tran 
>>> wrote:
>>>
 Hmm, I'm not sure what you mean by "updating pipelines in place".
 Can you elaborate?

 I forgot to mention my question is posed from the context of a
 python SDK user, and afaict, there doesn't seem to be an obvious way to
 autogenerate names/labels. Hearing that the java SDK supports it makes 
 me
 wonder if the python SDK could support it as well though... (If so, 
 I'd be
 happy to do implement it). Currently, it's fairly tedious to have to 
 name
 every instance of a transform that you might reuse in a pipeline, e.g. 
 when
 reapplying the same Map on different pcollections.

 On Sun, Oct 1, 2023 at 8:12 PM Reuven Lax via user <
 u...@beam.apache.org> wrote:

> Are you talking about transform names? The main reason was because
> for runners that support updating pipelines in place, the only way to 
> do so
> safely is if the runner can perfectly identify which transforms in 
> the new

Re: [ANNOUNCE] New PMC Member: Alex Van Boxel

2023-10-03 Thread Byron Ellis via dev
Congrats!

On Tue, Oct 3, 2023 at 12:40 PM Danielle Syse via dev 
wrote:

> Congratulations Alex!! Definitely well deserved!
>
> On Tue, Oct 3, 2023 at 2:57 PM Ahmet Altay via dev 
> wrote:
>
>> Congratulations Alex! Well deserved!
>>
>> On Tue, Oct 3, 2023 at 11:54 AM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations Alex!
>>>
>>> On Tue, Oct 3, 2023 at 2:54 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Alex, this is well deserved!

 On Tue, Oct 3, 2023 at 2:50 PM Jack McCluskey via dev <
 dev@beam.apache.org> wrote:

> Congrats, Alex!
>
> On Tue, Oct 3, 2023 at 2:49 PM XQ Hu via dev 
> wrote:
>
>> Configurations, Alex!
>>
>> On Tue, Oct 3, 2023 at 2:40 PM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming Alex Van
>>> Boxel  as our newest PMC member.
>>>
>>> Alex has been with Beam since 2016, very early in the life of the
>>> project. Alex has contributed code, design ideas, and perhaps most
>>> importantly been a huge part of organizing Beam Summits, and of course
>>> presenting at them as well. Alex really brings the ASF community spirit 
>>> to
>>> Beam.
>>>
>>> Congratulations Alex and thanks for being a part of Apache Beam!
>>>
>>> Kenn, on behalf of the Beam PMC (which now includes Alex)
>>>
>>


Re: [ANNOUNCE] New PMC Member: Robert Burke

2023-10-03 Thread Byron Ellis via dev
Congrats!

On Tue, Oct 3, 2023 at 12:40 PM Danielle Syse via dev 
wrote:

> Congrats Rebo!!! Can't wait to work more closely with you.
>
> On Tue, Oct 3, 2023 at 2:56 PM Ahmet Altay via dev 
> wrote:
>
>> Congratulations Robert! Well deserved!
>>
>> On Tue, Oct 3, 2023 at 11:54 AM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations Robert!
>>>
>>> On Tue, Oct 3, 2023 at 2:52 PM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Robert, this is very well deserved!

 On Tue, Oct 3, 2023 at 2:50 PM Anand Inguva via dev <
 dev@beam.apache.org> wrote:

> Congratulations!!
>
> On Tue, Oct 3, 2023 at 2:49 PM XQ Hu via dev 
> wrote:
>
>> Congratulations, Robert!
>>
>> On Tue, Oct 3, 2023 at 2:40 PM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming Robert
>>> Burke  as our newest PMC member.
>>>
>>> Robert has been a part of the Beam community since 2017. He is our
>>> resident Gopher, producing the Go SDK and most recently the local,
>>> portable, Prism runner. Robert has presented on Beam many times, having
>>> written not just core Beam code but quite interesting pipelines too :-)
>>>
>>> Congratulations Robert and thanks for being a part of Apache Beam!
>>>
>>> Kenn, on behalf of the Beam PMC (which now includes Robert)
>>>
>>


Re: [ANNOUNCE] New PMC Member: Valentyn Tymofieiev

2023-10-03 Thread Byron Ellis via dev
Congrats!

On Tue, Oct 3, 2023 at 12:40 PM Danielle Syse via dev 
wrote:

> Congrats Valentyn!!
>
> On Tue, Oct 3, 2023 at 2:59 PM Ahmet Altay via dev 
> wrote:
>
>> Congratulations Valentyn! Well deserved!
>>
>> On Tue, Oct 3, 2023 at 11:54 AM Ritesh Ghorse via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations Valentyn!
>>>
>>> On Tue, Oct 3, 2023 at 2:53 PM Jack McCluskey via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congrats Valentyn!

 On Tue, Oct 3, 2023 at 2:40 PM Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming Valentyn
> Tymofieiev  as our newest PMC member.
>
> Valentyn has been contributing to Beam since 2017. Notable highlights
> include his work on the Python SDK and also in our container management.
> Valentyn also is involved in many discussions around Beam's infrastructure
> and community processes. If you look through Valentyn's history, you will
> see an abundance of the most critical maintenance work that is the beating
> heart of any project.
>
> Congratulations Valentyn and thanks for being a part of Apache Beam!
>
> Kenn, on behalf of the Beam PMC (which now includes Valentyn)
>



Re: User-facing website vs. contributor-facing website

2023-09-22 Thread Byron Ellis via dev
I feel like that's actually pretty easy with Github actions? I think maybe
there's even one that exists Github Pages and probably any other static
site generator thingy we could care to name. Related, I stumbled across
this the other day: https://github.com/apache/beam-site which appears to be
unused which could probably even have different review and committer sets
if we wanted?

On Thu, Sep 21, 2023 at 3:19 PM Robert Bradshaw via dev 
wrote:

> TBH, I'm not a huge fan of the wikis either. My ideal flow would be
> something like g3doc, and markdown files in github do a reasonable enough
> job emulating that. (I don't think the overhead of having to do a PR for
> small edits like typos is oneros, as those are super easy reviews to do as
> well...) For anything in-depth, a pointer to an "actual" doc with better
> collaborative editing tools is generally in order anyway.
>
> I do feel strongly that https://beam.apache.org/contribute/ should remain
> on the main site, as it's aimed at users (who hopefully want to step up and
> contribute). The top level should probably mostly be a pointer to this, but
> I think it's valuable (for the audience that reaches it from github) to be
> a bit taylored to that audience (e.g. assume they just forked/downloaded
> the repository and want to edit-build-push. Generally a more advanced user
> than would find the page on the website.)
>
> The release guide? Meh. Wherever those doing releases find it most
> convenient. If that was me I'd probably put a markdown file right in the
> release directory next to the relevant scripts... (If not jump to literate
> programming right there :).
>
> On Thu, Sep 21, 2023 at 1:20 PM Kenneth Knowles  wrote:
>
>>
>>
>> On Thu, Sep 21, 2023 at 3:55 PM Danny McCormick <
>> dannymccorm...@google.com> wrote:
>>
>>>  > - reviewed
>>>
>>> Generally, I'm actually probably -0 on this one - it depends on context,
>>> but things that are for other developers only are usually better off
>>> without this requirement IMO since you get more contributions and more
>>> useful/unpolished things. Unfortunately, I'm not sure if confluence
>>> actually meets the bar for easy to update though because getting an
>>> account/initial setup is a pain. So I'm -0 since I don't know of a tool
>>> that both allows people to easily edit and avoids spam, but if such a tool
>>> exists I'd strongly prefer that.
>>>
>>> >  - discoverable/orientable aka top/side nav
>>>
>>> I'm -1 on this requirement. A structured in-repo `docs` folder and/or a
>>> dedicated developer documentation repo have worked well on teams I've been
>>> on in the past and it avoids having to maintain additional infrastructure
>>> for a website. It also brings folks closer to the code, making edits more
>>> likely. These look nice, but I don't know how much value they actually add.
>>>
>>> > I did a quick search to see if there was a standard answer to having
>>> top and side nav for a docs/ folder of markdown in your github repo. I
>>> guess that is GitHub Pages? TBH I have used them happily in the distant
>>> past but somehow I thought they had been deprecated or something.
>>>
>>> I'm probably -1 on pages because at that point we've got 2 different
>>> website setups, one using hugo (https://beam.apache.org/) and one using
>>> Jekyl (pages); at that point, we might as well just move things totally
>>> back into the website and just have it live under a separate section of the
>>> site.
>>>
>>> My vote if we're moving away from confluence (which seems fine) would be
>>> either a dedicated `docs` or `developer-docs` folder or a separate markdown
>>> only repo.
>>>
>>
>> I could go for this. I'm pretty -1 on a soup of files without any
>> information architecture or scattered throughout random folders. But I'm
>> probably -2 on the confluence wiki if such a thing is possible and it would
>> also remove a piece from our infra, so... I think I'd call it an upgrade to
>> have a folder full of docs. If we then make taxonomic subfolders that hide
>> all the information I'll be sad again.
>>
>> Ideally the developer-docs/ folder could be read as text, lightly
>> rendered like GH does, or fully rendered with navs. Yes, I am describing
>> g3doc (which is talked about publicly so I can name it, but I don't know
>> what the publicly-available equivalent is). None of the
>> website-building not-human-readable stuff from jekyll and hugo.
>>
>> Kenn
>>
>>
>>>
>>> On Thu, Sep 21, 2023 at 3:30 PM Kenneth Knowles  wrote:
>>>
 OK so this did turn into a discussion all about the tech/hosting :-).
 It has been 5 years and we have experience of the wiki now so maybe that is
 fair anyhow. And perhaps the preference of where to put information cannot
 be separated from it.

 Top posting because there was so much in common across the responses
 and I agree mostly too so I'll merge & paraphrase.

 > Focusing the main website primarily toward users is good

 Seems everyone still 

Re: Runner Bundling Strategies

2023-09-22 Thread Byron Ellis via dev
I've actually wondered about this specifically for streaming... if you're
writing a pipeline there it seems like you're often going to want to put
high fixed cost things like database connections even outside of the bundle
setup. You really only want to do that once in the lifetime of the worker
itself, not the bundle. Seems like having that boundary be somewhere other
than an arbitrarily (and probably small in streaming to avoid latency)
group of elements might be more useful? I suppose this depends heavily on
the object lifecycle in the sdk worker though.

Best,
B

On Fri, Sep 22, 2023 at 7:03 AM Kenneth Knowles  wrote:

> (I notice that you replied only to yourself, but there has been a whole
> thread of discussion on this - are you subscribed to dev@beam?
> https://lists.apache.org/thread/k81fq301ypwmjowknzyqq2qc63844rbd)
>
> It sounds like you want what everyone wants: to have the biggest bundles
> possible.
>
> So for bounded data, basically you make even splits of the data and each
> split is one bundle. And then dynamic splitting to redistribute work to
> eliminate stragglers, if your engine has that capability.
>
> For unbounded data, you more-or-less bundle as much as you can without
> waiting too long, like Jan described.
>
> Users know to put their high fixed costs in @StartBundle and then it is
> the runner's job to put as many calls to @ProcessElement as possible to
> amortize.
>
> Kenn
>
> On Fri, Sep 22, 2023 at 9:39 AM Joey Tran 
> wrote:
>
>> Whoops, I typoed my last email. I meant to write "this isn't the
>> greatest strategy for high *fixed* cost transforms", e.g. a transform
>> that takes 5 minutes to get set up and then maybe a microsecond per input
>>
>> I suppose one solution is to move the responsibility for handling this
>> kind of situation to the user and expect users to use a bundling transform
>> (e.g. BatchElements [1]) followed by a Reshuffle+FlatMap. Is this what
>> other runners expect? Just want to make sure I'm not missing some smart
>> generic bundling strategy that might handle this for users.
>>
>> [1]
>> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements
>>
>>
>> On Thu, Sep 21, 2023 at 7:23 PM Joey Tran 
>> wrote:
>>
>>> Writing a runner and the first strategy for determining bundling size
>>> was to just start with a bundle size of one and double it until we reach a
>>> size that we expect to take some targets per-bundle runtime (e.g. maybe 10
>>> minutes). I realize that this isn't the greatest strategy for high sized
>>> cost transforms. I'm curious what kind of strategies other runners take?
>>>
>>


Re: [Request for Feedback] Swift SDK Prototype

2023-09-20 Thread Byron Ellis via dev
>>>>>>>>> rather than dealing with a single element like you do in the 
>>>>>>>>>>>>>> other SDKs
>>>>>>>>>>>>>> you're dealing with a stream of elements (which of course will 
>>>>>>>>>>>>>> often be of
>>>>>>>>>>>>>> size 1). That's a really natural paradigm in the Swift world 
>>>>>>>>>>>>>> especially
>>>>>>>>>>>>>> with the async / await structures. So when you see something 
>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>>>>>>   ...
>>>>>>>>>>>>>>   output.emit(data)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> filenames is the input stream and then output and errors are
>>>>>>>>>>>>>> both output streams. In theory you can have as many output 
>>>>>>>>>>>>>> streams as you
>>>>>>>>>>>>>> like though at the moment there's a compiler bug in the new type 
>>>>>>>>>>>>>> pack
>>>>>>>>>>>>>> feature that limits it to "as many as I felt like supporting". 
>>>>>>>>>>>>>> Presumably
>>>>>>>>>>>>>> this will get fixed before the official 5.9 release which will 
>>>>>>>>>>>>>> probably be
>>>>>>>>>>>>>> in the October timeframe if history is any guide)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you had parameterization you wanted to send that would
>>>>>>>>>>>>>> look like pardo("Parameter") { param,filenames,output,error in 
>>>>>>>>>>>>>> ... } where
>>>>>>>>>>>>>> "param" would take on the value of "Parameter." All of this is 
>>>>>>>>>>>>>> being
>>>>>>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you
>>>>>>>>>>>>>> have in ES6 and other things where "_" is Swift for "ignore." In 
>>>>>>>>>>>>>> this case
>>>>>>>>>>>>>> PCollectionStreams have an element signature of (Of,Date,Window) 
>>>>>>>>>>>>>> so you can
>>>>>>>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>>>>>>>> manipulate
>>>>>>>>>>>>>> it somehow.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That said it would also be natural to provide elementwise
>>>>>>>>>>>>>> pardos--- that would probably mean having explicit type 
>>>>>>>>>>>>>> signatures in the
>>>>>>>>>>>>>> closure. I had that at one point, but it felt less natural the 
>>>>>>>>>>>>>> more I used
>>>>>>>>>>>>>> it. I'm also slowly working towards adding a more "traditional" 
>>>>>>>>>>>>>> DoFn
>>>>>>>>>>>>>> implementation approach where you implement the DoFn as an 
>>>>>>>>>>>>>> object type. In
>>>>>>>>>>>>>> that case it would be very very easy to support both by having a 
>>>>>>>>

Re: [Request for Feedback] Swift SDK Prototype

2023-09-20 Thread Byron Ellis via dev
a couple of places to do a true wordcount 
>>>>>>>>>>> using good
>>>>>>>>>>> ol' Shakespeare and very very primitive GCS IO.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> B
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis <
>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Indeed :-) Yeah, I went back and forth on the pardo syntax
>>>>>>>>>>>> quite a bit before settling on where I ended up. Ultimately I 
>>>>>>>>>>>> decided to go
>>>>>>>>>>>> with something that felt more Swift-y than anything else which 
>>>>>>>>>>>> means that
>>>>>>>>>>>> rather than dealing with a single element like you do in the other 
>>>>>>>>>>>> SDKs
>>>>>>>>>>>> you're dealing with a stream of elements (which of course will 
>>>>>>>>>>>> often be of
>>>>>>>>>>>> size 1). That's a really natural paradigm in the Swift world 
>>>>>>>>>>>> especially
>>>>>>>>>>>> with the async / await structures. So when you see something like:
>>>>>>>>>>>>
>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>>>>
>>>>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>>>>   ...
>>>>>>>>>>>>   output.emit(data)
>>>>>>>>>>>>
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> filenames is the input stream and then output and errors are
>>>>>>>>>>>> both output streams. In theory you can have as many output streams 
>>>>>>>>>>>> as you
>>>>>>>>>>>> like though at the moment there's a compiler bug in the new type 
>>>>>>>>>>>> pack
>>>>>>>>>>>> feature that limits it to "as many as I felt like supporting". 
>>>>>>>>>>>> Presumably
>>>>>>>>>>>> this will get fixed before the official 5.9 release which will 
>>>>>>>>>>>> probably be
>>>>>>>>>>>> in the October timeframe if history is any guide)
>>>>>>>>>>>>
>>>>>>>>>>>> If you had parameterization you wanted to send that would look
>>>>>>>>>>>> like pardo("Parameter") { param,filenames,output,error in ... } 
>>>>>>>>>>>> where
>>>>>>>>>>>> "param" would take on the value of "Parameter." All of this is 
>>>>>>>>>>>> being
>>>>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you have
>>>>>>>>>>>> in ES6 and other things where "_" is Swift for "ignore." In this 
>>>>>>>>>>>> case
>>>>>>>>>>>> PCollectionStreams have an element signature of (Of,Date,Window) 
>>>>>>>>>>>> so you can
>>>>>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>>>>>> manipulate
>>>>>>>>>>>> it somehow.
>>>>>>>>>>>>
>>>>>>>>>>>> That said it would also be natural to provide elementwise
>>>>>>>>>>>> pardos--- that would probably mean having explicit type signatures 
>>>>>>>>>>>> in the
>>>>>>>>>>>> closure. I had that at one point, but it felt le

Re: [Request for Feedback] Swift SDK Prototype

2023-09-14 Thread Byron Ellis via dev
 / await structures. So when you see something like:
>>>>>>>>>>>
>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>>>
>>>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>>>   ...
>>>>>>>>>>>   output.emit(data)
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> filenames is the input stream and then output and errors are
>>>>>>>>>>> both output streams. In theory you can have as many output streams 
>>>>>>>>>>> as you
>>>>>>>>>>> like though at the moment there's a compiler bug in the new type 
>>>>>>>>>>> pack
>>>>>>>>>>> feature that limits it to "as many as I felt like supporting". 
>>>>>>>>>>> Presumably
>>>>>>>>>>> this will get fixed before the official 5.9 release which will 
>>>>>>>>>>> probably be
>>>>>>>>>>> in the October timeframe if history is any guide)
>>>>>>>>>>>
>>>>>>>>>>> If you had parameterization you wanted to send that would look
>>>>>>>>>>> like pardo("Parameter") { param,filenames,output,error in ... } 
>>>>>>>>>>> where
>>>>>>>>>>> "param" would take on the value of "Parameter." All of this is being
>>>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you have
>>>>>>>>>>> in ES6 and other things where "_" is Swift for "ignore." In this 
>>>>>>>>>>> case
>>>>>>>>>>> PCollectionStreams have an element signature of (Of,Date,Window) so 
>>>>>>>>>>> you can
>>>>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>>>>> manipulate
>>>>>>>>>>> it somehow.
>>>>>>>>>>>
>>>>>>>>>>> That said it would also be natural to provide elementwise
>>>>>>>>>>> pardos--- that would probably mean having explicit type signatures 
>>>>>>>>>>> in the
>>>>>>>>>>> closure. I had that at one point, but it felt less natural the more 
>>>>>>>>>>> I used
>>>>>>>>>>> it. I'm also slowly working towards adding a more "traditional" DoFn
>>>>>>>>>>> implementation approach where you implement the DoFn as an object 
>>>>>>>>>>> type. In
>>>>>>>>>>> that case it would be very very easy to support both by having a 
>>>>>>>>>>> default
>>>>>>>>>>> stream implementation call the equivalent of processElement. To 
>>>>>>>>>>> make that
>>>>>>>>>>> performant I need to implement an @DoFn macro and I just haven't 
>>>>>>>>>>> gotten to
>>>>>>>>>>> it yet.
>>>>>>>>>>>
>>>>>>>>>>> It's a bit more work and I've been prioritizing implementing
>>>>>>>>>>> composite and external transforms for the reasons you suggest. :-) 
>>>>>>>>>>> I've got
>>>>>>>>>>> the basics of a composite transform (there's an equivalent wordcount
>>>>>>>>>>> example) and am hooking it into the pipeline generation, which 
>>>>>>>>>>> should also
>>>>>>>>>>> give me everything I need to successfully hook in external 
>>>>>>>>>>> transforms as
>>>>>>>>>>> well. That will give me the jump on IOs as you say. I can also 
>>>>>>>>>>>

Re: [Request for Feedback] Swift SDK Prototype

2023-08-29 Thread Byron Ellis via dev
n call the equivalent of processElement. To make 
>>>>>>>>>> that
>>>>>>>>>> performant I need to implement an @DoFn macro and I just haven't 
>>>>>>>>>> gotten to
>>>>>>>>>> it yet.
>>>>>>>>>>
>>>>>>>>>> It's a bit more work and I've been prioritizing implementing
>>>>>>>>>> composite and external transforms for the reasons you suggest. :-) 
>>>>>>>>>> I've got
>>>>>>>>>> the basics of a composite transform (there's an equivalent wordcount
>>>>>>>>>> example) and am hooking it into the pipeline generation, which 
>>>>>>>>>> should also
>>>>>>>>>> give me everything I need to successfully hook in external 
>>>>>>>>>> transforms as
>>>>>>>>>> well. That will give me the jump on IOs as you say. I can also treat 
>>>>>>>>>> the
>>>>>>>>>> pipeline itself as a composite transform which lets me get rid of the
>>>>>>>>>> Pipeline { pipeline in ... } and just instead have things attach 
>>>>>>>>>> themselves
>>>>>>>>>> to the pipeline implicitly.
>>>>>>>>>>
>>>>>>>>>> That said, there are some interesting IO possibilities that would
>>>>>>>>>> be Swift native. In particularly, I've been looking at the native 
>>>>>>>>>> Swift
>>>>>>>>>> binding for DuckDB (which is C++ based). DuckDB is SQL based but not
>>>>>>>>>> distributed in the same was as, say, Beam SQL... but it would allow 
>>>>>>>>>> for SQL
>>>>>>>>>> statements on individual files with projection pushdown supported for
>>>>>>>>>> things like Parquet which could have some cool and performant data 
>>>>>>>>>> lake
>>>>>>>>>> applications. I'll probably do a couple of the simpler IOs as
>>>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good that would 
>>>>>>>>>> give
>>>>>>>>>> me S3 and there's a Cloud auth library as well that makes it pretty 
>>>>>>>>>> easy to
>>>>>>>>>> work with GCS.
>>>>>>>>>>
>>>>>>>>>> In any case, I'm updating the branch as I find a minute here and
>>>>>>>>>> there.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> B
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw <
>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Neat.
>>>>>>>>>>>
>>>>>>>>>>> Nothing like writing and SDK to actually understand how the
>>>>>>>>>>> FnAPI works :). I like the use of groupBy. I have to admit I'm a bit
>>>>>>>>>>> mystified by the syntax for parDo (I don't know swift at all which 
>>>>>>>>>>> is
>>>>>>>>>>> probably tripping me up). The addition of external (cross-language)
>>>>>>>>>>> transforms could let you steal everything (e.g. IOs) pretty quickly 
>>>>>>>>>>> from
>>>>>>>>>>> other SDKs.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>>>>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For everyone who is interested, here's the draft PR:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/beam/pull/28062
>>>>>>>>>>>>
>>>>>>>>>>>> I haven't had a chance to test it on my M1 machine yet though
>>>>>>>>>>>> (there's a good chance there are a few places that need to 
>>>>>>>>>>>> properly address
>>>>>>>>>>>> endianness. Specifically timestamps in windowed values and length 
>>>>>>>>>>>> in
>>>>>>>>>>>> iterable coders as those both use specifically bigendian 
>>>>>>>>>>>> representations)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis <
>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Cham,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Definitely happy to open a draft PR so folks can
>>>>>>>>>>>>> comment---there's not as much code as it looks like since most of 
>>>>>>>>>>>>> the LOC
>>>>>>>>>>>>> is just generated protobuf. As for the support, I definitely want 
>>>>>>>>>>>>> to add
>>>>>>>>>>>>> external transforms and may actually add that support before 
>>>>>>>>>>>>> adding the
>>>>>>>>>>>>> ability to make composites in the language itself. With the way 
>>>>>>>>>>>>> the SDK is
>>>>>>>>>>>>> laid out adding composites to the pipeline graph is a separate 
>>>>>>>>>>>>> operation
>>>>>>>>>>>>> than defining a composite.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks Byron. This sounds great. I wonder if there is
>>>>>>>>>>>>>> interest in Swift SDK from folks currently subscribed to the
>>>>>>>>>>>>>> +user  list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>>>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> A couple of months ago I decided that I wanted to really
>>>>>>>>>>>>>>> understand how the Beam FnApi works and how it interacts with 
>>>>>>>>>>>>>>> the Portable
>>>>>>>>>>>>>>> Runner. For me at least that usually means I need to write some 
>>>>>>>>>>>>>>> code so I
>>>>>>>>>>>>>>> can see things happening in a debugger and to really prove to 
>>>>>>>>>>>>>>> myself I
>>>>>>>>>>>>>>> understood what was going on I decided I couldn't use an 
>>>>>>>>>>>>>>> existing SDK
>>>>>>>>>>>>>>> language to do it since there would be the temptation to read 
>>>>>>>>>>>>>>> some code and
>>>>>>>>>>>>>>> convince myself that I actually understood what was going on.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One thing led to another and it turns out that to get a
>>>>>>>>>>>>>>> minimal FnApi integration going you end up writing a fair bit 
>>>>>>>>>>>>>>> of an SDK. So
>>>>>>>>>>>>>>> I decided to take things to a point where I had an SDK that 
>>>>>>>>>>>>>>> could execute a
>>>>>>>>>>>>>>> word count example via a portable runner backend. I've now 
>>>>>>>>>>>>>>> reached that
>>>>>>>>>>>>>>> point and would like to submit my prototype SDK to the list for 
>>>>>>>>>>>>>>> feedback.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's currently living in a branch on my fork here:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> At the moment it runs via the most recent XCode Beta using
>>>>>>>>>>>>>>> Swift 5.9 on Intel Macs, but should also work using beta builds 
>>>>>>>>>>>>>>> of 5.9 for
>>>>>>>>>>>>>>> Linux running on Intel hardware. I haven't had a chance to try 
>>>>>>>>>>>>>>> it on ARM
>>>>>>>>>>>>>>> hardware and make sure all of the endian checks are complete. 
>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>> "IntegrationTests.swift" file contains a word count example 
>>>>>>>>>>>>>>> that reads some
>>>>>>>>>>>>>>> local files (as well as a missing file to exercise DLQ 
>>>>>>>>>>>>>>> functionality) and
>>>>>>>>>>>>>>> output counts through two separate group by operations to get 
>>>>>>>>>>>>>>> it past the
>>>>>>>>>>>>>>> "map reduce" size of pipeline. I've tested it against the 
>>>>>>>>>>>>>>> Python Portable
>>>>>>>>>>>>>>> Runner. Since my goal was to learn FnApi there is no Direct 
>>>>>>>>>>>>>>> Runner at this
>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've shown it to a couple of folks already and incorporated
>>>>>>>>>>>>>>> some of that feedback already (for example pardo was originally 
>>>>>>>>>>>>>>> called dofn
>>>>>>>>>>>>>>> when defining pipelines). In general I've tried to make the API 
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> "Swift-y" as possible, hence the heavy reliance on closures and 
>>>>>>>>>>>>>>> while there
>>>>>>>>>>>>>>> aren't yet composite PTransforms there's the beginnings of what 
>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>> needed for a SwiftUI-like declarative API for creating them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are of course a ton of missing bits still to be
>>>>>>>>>>>>>>> implemented, like counters, metrics, windowing, state, timers, 
>>>>>>>>>>>>>>> etc.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This should be fine and we can get the code documented
>>>>>>>>>>>>>> without these features. I think support for composites and 
>>>>>>>>>>>>>> adding an
>>>>>>>>>>>>>> external transform (see, Java
>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
>>>>>>>>>>>>>> Python
>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
>>>>>>>>>>>>>> Go
>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
>>>>>>>>>>>>>> TypeScript
>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
>>>>>>>>>>>>>> to add support for multi-lang will bring in a lot of features 
>>>>>>>>>>>>>> (for example,
>>>>>>>>>>>>>> I/O connectors) for free.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Any and all feedback welcome and happy to submit a PR if
>>>>>>>>>>>>>>> folks are interested, though the "Swift Way" would be to have 
>>>>>>>>>>>>>>> it in its own
>>>>>>>>>>>>>>> repo so that it can easily be used from the Swift Package 
>>>>>>>>>>>>>>> Manager.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1 for creating a PR (may be as a draft initially). Also
>>>>>>>>>>>>>> it'll be easier to comment on a PR :)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Cham
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>


Re: [Request for Feedback] Swift SDK Prototype

2023-08-25 Thread Byron Ellis via dev
t;>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>
>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>   ...
>>>>>>>>>   output.emit(data)
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> filenames is the input stream and then output and errors are both
>>>>>>>>> output streams. In theory you can have as many output streams as you 
>>>>>>>>> like
>>>>>>>>> though at the moment there's a compiler bug in the new type pack 
>>>>>>>>> feature
>>>>>>>>> that limits it to "as many as I felt like supporting". Presumably 
>>>>>>>>> this will
>>>>>>>>> get fixed before the official 5.9 release which will probably be in 
>>>>>>>>> the
>>>>>>>>> October timeframe if history is any guide)
>>>>>>>>>
>>>>>>>>> If you had parameterization you wanted to send that would look
>>>>>>>>> like pardo("Parameter") { param,filenames,output,error in ... } where
>>>>>>>>> "param" would take on the value of "Parameter." All of this is being
>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you have in
>>>>>>>>> ES6 and other things where "_" is Swift for "ignore." In this case
>>>>>>>>> PCollectionStreams have an element signature of (Of,Date,Window) so 
>>>>>>>>> you can
>>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>>> manipulate
>>>>>>>>> it somehow.
>>>>>>>>>
>>>>>>>>> That said it would also be natural to provide elementwise
>>>>>>>>> pardos--- that would probably mean having explicit type signatures in 
>>>>>>>>> the
>>>>>>>>> closure. I had that at one point, but it felt less natural the more I 
>>>>>>>>> used
>>>>>>>>> it. I'm also slowly working towards adding a more "traditional" DoFn
>>>>>>>>> implementation approach where you implement the DoFn as an object 
>>>>>>>>> type. In
>>>>>>>>> that case it would be very very easy to support both by having a 
>>>>>>>>> default
>>>>>>>>> stream implementation call the equivalent of processElement. To make 
>>>>>>>>> that
>>>>>>>>> performant I need to implement an @DoFn macro and I just haven't 
>>>>>>>>> gotten to
>>>>>>>>> it yet.
>>>>>>>>>
>>>>>>>>> It's a bit more work and I've been prioritizing implementing
>>>>>>>>> composite and external transforms for the reasons you suggest. :-) 
>>>>>>>>> I've got
>>>>>>>>> the basics of a composite transform (there's an equivalent wordcount
>>>>>>>>> example) and am hooking it into the pipeline generation, which should 
>>>>>>>>> also
>>>>>>>>> give me everything I need to successfully hook in external transforms 
>>>>>>>>> as
>>>>>>>>> well. That will give me the jump on IOs as you say. I can also treat 
>>>>>>>>> the
>>>>>>>>> pipeline itself as a composite transform which lets me get rid of the
>>>>>>>>> Pipeline { pipeline in ... } and just instead have things attach 
>>>>>>>>> themselves
>>>>>>>>> to the pipeline implicitly.
>>>>>>>>>
>>>>>>>>> That said, there are some interesting IO possibilities that would
>>>>>>>>> be Swift native. In particularly, I've been looking at the native 
>>>>>>>>> Swift
>>>>>>>>> binding for Duck

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Byron Ellis via dev
n the value of "Parameter." All of this is being 
>>>>>>>> typechecked at
>>>>>>>> compile time BTW.
>>>>>>>>
>>>>>>>>
>>>>>>>> the (filename,_,_) is a tuple spreading construct like you have in
>>>>>>>> ES6 and other things where "_" is Swift for "ignore." In this case
>>>>>>>> PCollectionStreams have an element signature of (Of,Date,Window) so 
>>>>>>>> you can
>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>> manipulate
>>>>>>>> it somehow.
>>>>>>>>
>>>>>>>> That said it would also be natural to provide elementwise pardos---
>>>>>>>> that would probably mean having explicit type signatures in the 
>>>>>>>> closure. I
>>>>>>>> had that at one point, but it felt less natural the more I used it. I'm
>>>>>>>> also slowly working towards adding a more "traditional" DoFn 
>>>>>>>> implementation
>>>>>>>> approach where you implement the DoFn as an object type. In that case 
>>>>>>>> it
>>>>>>>> would be very very easy to support both by having a default stream
>>>>>>>> implementation call the equivalent of processElement. To make that
>>>>>>>> performant I need to implement an @DoFn macro and I just haven't 
>>>>>>>> gotten to
>>>>>>>> it yet.
>>>>>>>>
>>>>>>>> It's a bit more work and I've been prioritizing implementing
>>>>>>>> composite and external transforms for the reasons you suggest. :-) 
>>>>>>>> I've got
>>>>>>>> the basics of a composite transform (there's an equivalent wordcount
>>>>>>>> example) and am hooking it into the pipeline generation, which should 
>>>>>>>> also
>>>>>>>> give me everything I need to successfully hook in external transforms 
>>>>>>>> as
>>>>>>>> well. That will give me the jump on IOs as you say. I can also treat 
>>>>>>>> the
>>>>>>>> pipeline itself as a composite transform which lets me get rid of the
>>>>>>>> Pipeline { pipeline in ... } and just instead have things attach 
>>>>>>>> themselves
>>>>>>>> to the pipeline implicitly.
>>>>>>>>
>>>>>>>> That said, there are some interesting IO possibilities that would
>>>>>>>> be Swift native. In particularly, I've been looking at the native Swift
>>>>>>>> binding for DuckDB (which is C++ based). DuckDB is SQL based but not
>>>>>>>> distributed in the same was as, say, Beam SQL... but it would allow 
>>>>>>>> for SQL
>>>>>>>> statements on individual files with projection pushdown supported for
>>>>>>>> things like Parquet which could have some cool and performant data lake
>>>>>>>> applications. I'll probably do a couple of the simpler IOs as
>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good that would 
>>>>>>>> give
>>>>>>>> me S3 and there's a Cloud auth library as well that makes it pretty 
>>>>>>>> easy to
>>>>>>>> work with GCS.
>>>>>>>>
>>>>>>>> In any case, I'm updating the branch as I find a minute here and
>>>>>>>> there.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> B
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw <
>>>>>>>> rober...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Neat.
>>>>>>>>>
>>>>>>>>> Nothing like writing and SDK to actually understand how the FnAPI
>>>&g

Re: [ANNOUNCE] New committer: Ahmed Abualsaud

2023-08-24 Thread Byron Ellis via dev
Congratulations!

On Thu, Aug 24, 2023 at 5:34 PM Robert Burke  wrote:

> Congratulations Ahmed!!
>
> On Thu, Aug 24, 2023, 4:08 PM Chamikara Jayalath via dev <
> dev@beam.apache.org> wrote:
>
>> Congrats Ahmed!!
>>
>> On Thu, Aug 24, 2023 at 4:06 PM Bruno Volpato via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations, Ahmed!
>>>
>>> Very well deserved!
>>>
>>>
>>> On Thu, Aug 24, 2023 at 6:09 PM XQ Hu via dev 
>>> wrote:
>>>
 Congratulations, Ahmed!

 On Thu, Aug 24, 2023, 5:49 PM Ahmet Altay via dev 
 wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Ahmed Abualsaud (ahmedabuals...@apache.org).
>
> Ahmed has been part of the Beam community since January 2022, working
> mostly on IO connectors, made a large amount of contributions to make Beam
> IOs more usable, performant, and reliable. And at the same time Ahmed was
> active in the user list and at the Beam summit helping users by sharing 
> his
> knowledge.
>
> Considering their contributions to the project over this timeframe,
> the Beam PMC trusts Ahmed with the responsibilities of a Beam committer. 
> [1]
>
> Thank you Ahmed! And we are looking to see more of your contributions!
>
> Ahmet, on behalf of the Apache Beam PMC
>
> [1]
>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>
>


Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Byron Ellis via dev
plicitly.
>>>>>>>
>>>>>>> That said, there are some interesting IO possibilities that would be
>>>>>>> Swift native. In particularly, I've been looking at the native Swift
>>>>>>> binding for DuckDB (which is C++ based). DuckDB is SQL based but not
>>>>>>> distributed in the same was as, say, Beam SQL... but it would allow for 
>>>>>>> SQL
>>>>>>> statements on individual files with projection pushdown supported for
>>>>>>> things like Parquet which could have some cool and performant data lake
>>>>>>> applications. I'll probably do a couple of the simpler IOs as
>>>>>>> well---there's a Swift AWS SDK binding that's pretty good that would 
>>>>>>> give
>>>>>>> me S3 and there's a Cloud auth library as well that makes it pretty 
>>>>>>> easy to
>>>>>>> work with GCS.
>>>>>>>
>>>>>>> In any case, I'm updating the branch as I find a minute here and
>>>>>>> there.
>>>>>>>
>>>>>>> Best,
>>>>>>> B
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Neat.
>>>>>>>>
>>>>>>>> Nothing like writing and SDK to actually understand how the FnAPI
>>>>>>>> works :). I like the use of groupBy. I have to admit I'm a bit 
>>>>>>>> mystified by
>>>>>>>> the syntax for parDo (I don't know swift at all which is probably 
>>>>>>>> tripping
>>>>>>>> me up). The addition of external (cross-language) transforms could let 
>>>>>>>> you
>>>>>>>> steal everything (e.g. IOs) pretty quickly from other SDKs.
>>>>>>>>
>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>>>>>>>> u...@beam.apache.org> wrote:
>>>>>>>>
>>>>>>>>> For everyone who is interested, here's the draft PR:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/beam/pull/28062
>>>>>>>>>
>>>>>>>>> I haven't had a chance to test it on my M1 machine yet though
>>>>>>>>> (there's a good chance there are a few places that need to properly 
>>>>>>>>> address
>>>>>>>>> endianness. Specifically timestamps in windowed values and length in
>>>>>>>>> iterable coders as those both use specifically bigendian 
>>>>>>>>> representations)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Cham,
>>>>>>>>>>
>>>>>>>>>> Definitely happy to open a draft PR so folks can
>>>>>>>>>> comment---there's not as much code as it looks like since most of 
>>>>>>>>>> the LOC
>>>>>>>>>> is just generated protobuf. As for the support, I definitely want to 
>>>>>>>>>> add
>>>>>>>>>> external transforms and may actually add that support before adding 
>>>>>>>>>> the
>>>>>>>>>> ability to make composites in the language itself. With the way the 
>>>>>>>>>> SDK is
>>>>>>>>>> laid out adding composites to the pipeline graph is a separate 
>>>>>>>>>> operation
>>>>>>>>>> than defining a composite.
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Byron. This sounds great. I wo

Re: [Request for Feedback] Swift SDK Prototype

2023-08-24 Thread Byron Ellis via dev
the
>>> basics of a composite transform (there's an equivalent wordcount example)
>>> and am hooking it into the pipeline generation, which should also give me
>>> everything I need to successfully hook in external transforms as well. That
>>> will give me the jump on IOs as you say. I can also treat the pipeline
>>> itself as a composite transform which lets me get rid of the Pipeline {
>>> pipeline in ... } and just instead have things attach themselves to the
>>> pipeline implicitly.
>>>
>>> That said, there are some interesting IO possibilities that would be
>>> Swift native. In particularly, I've been looking at the native Swift
>>> binding for DuckDB (which is C++ based). DuckDB is SQL based but not
>>> distributed in the same was as, say, Beam SQL... but it would allow for SQL
>>> statements on individual files with projection pushdown supported for
>>> things like Parquet which could have some cool and performant data lake
>>> applications. I'll probably do a couple of the simpler IOs as
>>> well---there's a Swift AWS SDK binding that's pretty good that would give
>>> me S3 and there's a Cloud auth library as well that makes it pretty easy to
>>> work with GCS.
>>>
>>> In any case, I'm updating the branch as I find a minute here and there.
>>>
>>> Best,
>>> B
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw 
>>> wrote:
>>>
>>>> Neat.
>>>>
>>>> Nothing like writing and SDK to actually understand how the FnAPI works
>>>> :). I like the use of groupBy. I have to admit I'm a bit mystified by the
>>>> syntax for parDo (I don't know swift at all which is probably tripping me
>>>> up). The addition of external (cross-language) transforms could let you
>>>> steal everything (e.g. IOs) pretty quickly from other SDKs.
>>>>
>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>>>> u...@beam.apache.org> wrote:
>>>>
>>>>> For everyone who is interested, here's the draft PR:
>>>>>
>>>>> https://github.com/apache/beam/pull/28062
>>>>>
>>>>> I haven't had a chance to test it on my M1 machine yet though (there's
>>>>> a good chance there are a few places that need to properly address
>>>>> endianness. Specifically timestamps in windowed values and length in
>>>>> iterable coders as those both use specifically bigendian representations)
>>>>>
>>>>>
>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis 
>>>>> wrote:
>>>>>
>>>>>> Thanks Cham,
>>>>>>
>>>>>> Definitely happy to open a draft PR so folks can comment---there's
>>>>>> not as much code as it looks like since most of the LOC is just generated
>>>>>> protobuf. As for the support, I definitely want to add external 
>>>>>> transforms
>>>>>> and may actually add that support before adding the ability to make
>>>>>> composites in the language itself. With the way the SDK is laid out 
>>>>>> adding
>>>>>> composites to the pipeline graph is a separate operation than defining a
>>>>>> composite.
>>>>>>
>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>>>> chamik...@google.com> wrote:
>>>>>>
>>>>>>> Thanks Byron. This sounds great. I wonder if there is interest in
>>>>>>> Swift SDK from folks currently subscribed to the +user
>>>>>>>  list.
>>>>>>>
>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> Hello everyone,
>>>>>>>>
>>>>>>>> A couple of months ago I decided that I wanted to really understand
>>>>>>>> how the Beam FnApi works and how it interacts with the Portable 
>>>>>>>> Runner. For
>>>>>>>> me at least that usually means I need to write some code so I can see
>>>>>>>> things happening in a debugger and to really prove to myself I
>>>>>>>> understood what was going on I decided I couldn't us

Re: [Request for Feedback] Swift SDK Prototype

2023-08-23 Thread Byron Ellis via dev
s probably tripping me
>> up). The addition of external (cross-language) transforms could let you
>> steal everything (e.g. IOs) pretty quickly from other SDKs.
>>
>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>> u...@beam.apache.org> wrote:
>>
>>> For everyone who is interested, here's the draft PR:
>>>
>>> https://github.com/apache/beam/pull/28062
>>>
>>> I haven't had a chance to test it on my M1 machine yet though (there's a
>>> good chance there are a few places that need to properly address
>>> endianness. Specifically timestamps in windowed values and length in
>>> iterable coders as those both use specifically bigendian representations)
>>>
>>>
>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis 
>>> wrote:
>>>
>>>> Thanks Cham,
>>>>
>>>> Definitely happy to open a draft PR so folks can comment---there's not
>>>> as much code as it looks like since most of the LOC is just generated
>>>> protobuf. As for the support, I definitely want to add external transforms
>>>> and may actually add that support before adding the ability to make
>>>> composites in the language itself. With the way the SDK is laid out adding
>>>> composites to the pipeline graph is a separate operation than defining a
>>>> composite.
>>>>
>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>> chamik...@google.com> wrote:
>>>>
>>>>> Thanks Byron. This sounds great. I wonder if there is interest in
>>>>> Swift SDK from folks currently subscribed to the +user
>>>>>  list.
>>>>>
>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> A couple of months ago I decided that I wanted to really understand
>>>>>> how the Beam FnApi works and how it interacts with the Portable Runner. 
>>>>>> For
>>>>>> me at least that usually means I need to write some code so I can see
>>>>>> things happening in a debugger and to really prove to myself I
>>>>>> understood what was going on I decided I couldn't use an existing SDK
>>>>>> language to do it since there would be the temptation to read some code 
>>>>>> and
>>>>>> convince myself that I actually understood what was going on.
>>>>>>
>>>>>> One thing led to another and it turns out that to get a minimal FnApi
>>>>>> integration going you end up writing a fair bit of an SDK. So I decided 
>>>>>> to
>>>>>> take things to a point where I had an SDK that could execute a word count
>>>>>> example via a portable runner backend. I've now reached that point and
>>>>>> would like to submit my prototype SDK to the list for feedback.
>>>>>>
>>>>>> It's currently living in a branch on my fork here:
>>>>>>
>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>>>>
>>>>>> At the moment it runs via the most recent XCode Beta using Swift 5.9
>>>>>> on Intel Macs, but should also work using beta builds of 5.9 for Linux
>>>>>> running on Intel hardware. I haven't had a chance to try it on ARM 
>>>>>> hardware
>>>>>> and make sure all of the endian checks are complete. The
>>>>>> "IntegrationTests.swift" file contains a word count example that reads 
>>>>>> some
>>>>>> local files (as well as a missing file to exercise DLQ functionality) and
>>>>>> output counts through two separate group by operations to get it past the
>>>>>> "map reduce" size of pipeline. I've tested it against the Python Portable
>>>>>> Runner. Since my goal was to learn FnApi there is no Direct Runner at 
>>>>>> this
>>>>>> time.
>>>>>>
>>>>>> I've shown it to a couple of folks already and incorporated some of
>>>>>> that feedback already (for example pardo was originally called dofn when
>>>>>> defining pipelines). In general I've tried to make the API as "Swift-y" 
>>>>>> as
>>>>>> possible, hence the heavy reliance on closur

Re: [Request for Feedback] Swift SDK Prototype

2023-08-23 Thread Byron Ellis via dev
ding the ability to make
>>> composites in the language itself. With the way the SDK is laid out adding
>>> composites to the pipeline graph is a separate operation than defining a
>>> composite.
>>>
>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath 
>>> wrote:
>>>
>>>> Thanks Byron. This sounds great. I wonder if there is interest in Swift
>>>> SDK from folks currently subscribed to the +user 
>>>>  list.
>>>>
>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> Hello everyone,
>>>>>
>>>>> A couple of months ago I decided that I wanted to really understand
>>>>> how the Beam FnApi works and how it interacts with the Portable Runner. 
>>>>> For
>>>>> me at least that usually means I need to write some code so I can see
>>>>> things happening in a debugger and to really prove to myself I
>>>>> understood what was going on I decided I couldn't use an existing SDK
>>>>> language to do it since there would be the temptation to read some code 
>>>>> and
>>>>> convince myself that I actually understood what was going on.
>>>>>
>>>>> One thing led to another and it turns out that to get a minimal FnApi
>>>>> integration going you end up writing a fair bit of an SDK. So I decided to
>>>>> take things to a point where I had an SDK that could execute a word count
>>>>> example via a portable runner backend. I've now reached that point and
>>>>> would like to submit my prototype SDK to the list for feedback.
>>>>>
>>>>> It's currently living in a branch on my fork here:
>>>>>
>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>>>
>>>>> At the moment it runs via the most recent XCode Beta using Swift 5.9
>>>>> on Intel Macs, but should also work using beta builds of 5.9 for Linux
>>>>> running on Intel hardware. I haven't had a chance to try it on ARM 
>>>>> hardware
>>>>> and make sure all of the endian checks are complete. The
>>>>> "IntegrationTests.swift" file contains a word count example that reads 
>>>>> some
>>>>> local files (as well as a missing file to exercise DLQ functionality) and
>>>>> output counts through two separate group by operations to get it past the
>>>>> "map reduce" size of pipeline. I've tested it against the Python Portable
>>>>> Runner. Since my goal was to learn FnApi there is no Direct Runner at this
>>>>> time.
>>>>>
>>>>> I've shown it to a couple of folks already and incorporated some of
>>>>> that feedback already (for example pardo was originally called dofn when
>>>>> defining pipelines). In general I've tried to make the API as "Swift-y" as
>>>>> possible, hence the heavy reliance on closures and while there aren't yet
>>>>> composite PTransforms there's the beginnings of what would be needed for a
>>>>> SwiftUI-like declarative API for creating them.
>>>>>
>>>>> There are of course a ton of missing bits still to be implemented,
>>>>> like counters, metrics, windowing, state, timers, etc.
>>>>>
>>>>
>>>> This should be fine and we can get the code documented without these
>>>> features. I think support for composites and adding an external transform
>>>> (see, Java
>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
>>>> Python
>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
>>>> Go
>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
>>>> TypeScript
>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
>>>> to add support for multi-lang will bring in a lot of features (for example,
>>>> I/O connectors) for free.
>>>>
>>>>
>>>>>
>>>>> Any and all feedback welcome and happy to submit a PR if folks are
>>>>> interested, though the "Swift Way" would be to have it in its own repo so
>>>>> that it can easily be used from the Swift Package Manager.
>>>>>
>>>>
>>>> +1 for creating a PR (may be as a draft initially). Also it'll be
>>>> easier to comment on a PR :)
>>>>
>>>> - Cham
>>>>
>>>> [1]
>>>> [2]
>>>> [3]
>>>>
>>>>
>>>>>
>>>>> Best,
>>>>> B
>>>>>
>>>>>
>>>>>


Re: [Request for Feedback] Swift SDK Prototype

2023-08-18 Thread Byron Ellis via dev
For everyone who is interested, here's the draft PR:

https://github.com/apache/beam/pull/28062

I haven't had a chance to test it on my M1 machine yet though (there's a
good chance there are a few places that need to properly address
endianness. Specifically timestamps in windowed values and length in
iterable coders as those both use specifically bigendian representations)


On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis  wrote:

> Thanks Cham,
>
> Definitely happy to open a draft PR so folks can comment---there's not as
> much code as it looks like since most of the LOC is just generated
> protobuf. As for the support, I definitely want to add external transforms
> and may actually add that support before adding the ability to make
> composites in the language itself. With the way the SDK is laid out adding
> composites to the pipeline graph is a separate operation than defining a
> composite.
>
> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath 
> wrote:
>
>> Thanks Byron. This sounds great. I wonder if there is interest in Swift
>> SDK from folks currently subscribed to the +user 
>>  list.
>>
>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> A couple of months ago I decided that I wanted to really understand how
>>> the Beam FnApi works and how it interacts with the Portable Runner. For me
>>> at least that usually means I need to write some code so I can see things
>>> happening in a debugger and to really prove to myself I understood what was
>>> going on I decided I couldn't use an existing SDK language to do it since
>>> there would be the temptation to read some code and convince myself that I
>>> actually understood what was going on.
>>>
>>> One thing led to another and it turns out that to get a minimal FnApi
>>> integration going you end up writing a fair bit of an SDK. So I decided to
>>> take things to a point where I had an SDK that could execute a word count
>>> example via a portable runner backend. I've now reached that point and
>>> would like to submit my prototype SDK to the list for feedback.
>>>
>>> It's currently living in a branch on my fork here:
>>>
>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>
>>> At the moment it runs via the most recent XCode Beta using Swift 5.9 on
>>> Intel Macs, but should also work using beta builds of 5.9 for Linux running
>>> on Intel hardware. I haven't had a chance to try it on ARM hardware and
>>> make sure all of the endian checks are complete. The
>>> "IntegrationTests.swift" file contains a word count example that reads some
>>> local files (as well as a missing file to exercise DLQ functionality) and
>>> output counts through two separate group by operations to get it past the
>>> "map reduce" size of pipeline. I've tested it against the Python Portable
>>> Runner. Since my goal was to learn FnApi there is no Direct Runner at this
>>> time.
>>>
>>> I've shown it to a couple of folks already and incorporated some of that
>>> feedback already (for example pardo was originally called dofn when
>>> defining pipelines). In general I've tried to make the API as "Swift-y" as
>>> possible, hence the heavy reliance on closures and while there aren't yet
>>> composite PTransforms there's the beginnings of what would be needed for a
>>> SwiftUI-like declarative API for creating them.
>>>
>>> There are of course a ton of missing bits still to be implemented, like
>>> counters, metrics, windowing, state, timers, etc.
>>>
>>
>> This should be fine and we can get the code documented without these
>> features. I think support for composites and adding an external transform
>> (see, Java
>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
>> Python
>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
>> Go
>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
>> TypeScript
>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
>> to add support for multi-lang will bring in a lot of features (for example,
>> I/O connectors) for free.
>>
>>
>>>
>>> Any and all feedback welcome and happy to submit a PR if folks are
>>> interested, though the "Swift Way" would be to have it in its own repo so
>>> that it can easily be used from the Swift Package Manager.
>>>
>>
>> +1 for creating a PR (may be as a draft initially). Also it'll be easier
>> to comment on a PR :)
>>
>> - Cham
>>
>> [1]
>> [2]
>> [3]
>>
>>
>>>
>>> Best,
>>> B
>>>
>>>
>>>


Re: [Request for Feedback] Swift SDK Prototype

2023-08-17 Thread Byron Ellis via dev
Thanks Cham,

Definitely happy to open a draft PR so folks can comment---there's not as
much code as it looks like since most of the LOC is just generated
protobuf. As for the support, I definitely want to add external transforms
and may actually add that support before adding the ability to make
composites in the language itself. With the way the SDK is laid out adding
composites to the pipeline graph is a separate operation than defining a
composite.

On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath 
wrote:

> Thanks Byron. This sounds great. I wonder if there is interest in Swift
> SDK from folks currently subscribed to the +user 
>  list.
>
> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev 
> wrote:
>
>> Hello everyone,
>>
>> A couple of months ago I decided that I wanted to really understand how
>> the Beam FnApi works and how it interacts with the Portable Runner. For me
>> at least that usually means I need to write some code so I can see things
>> happening in a debugger and to really prove to myself I understood what was
>> going on I decided I couldn't use an existing SDK language to do it since
>> there would be the temptation to read some code and convince myself that I
>> actually understood what was going on.
>>
>> One thing led to another and it turns out that to get a minimal FnApi
>> integration going you end up writing a fair bit of an SDK. So I decided to
>> take things to a point where I had an SDK that could execute a word count
>> example via a portable runner backend. I've now reached that point and
>> would like to submit my prototype SDK to the list for feedback.
>>
>> It's currently living in a branch on my fork here:
>>
>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>
>> At the moment it runs via the most recent XCode Beta using Swift 5.9 on
>> Intel Macs, but should also work using beta builds of 5.9 for Linux running
>> on Intel hardware. I haven't had a chance to try it on ARM hardware and
>> make sure all of the endian checks are complete. The
>> "IntegrationTests.swift" file contains a word count example that reads some
>> local files (as well as a missing file to exercise DLQ functionality) and
>> output counts through two separate group by operations to get it past the
>> "map reduce" size of pipeline. I've tested it against the Python Portable
>> Runner. Since my goal was to learn FnApi there is no Direct Runner at this
>> time.
>>
>> I've shown it to a couple of folks already and incorporated some of that
>> feedback already (for example pardo was originally called dofn when
>> defining pipelines). In general I've tried to make the API as "Swift-y" as
>> possible, hence the heavy reliance on closures and while there aren't yet
>> composite PTransforms there's the beginnings of what would be needed for a
>> SwiftUI-like declarative API for creating them.
>>
>> There are of course a ton of missing bits still to be implemented, like
>> counters, metrics, windowing, state, timers, etc.
>>
>
> This should be fine and we can get the code documented without these
> features. I think support for composites and adding an external transform
> (see, Java
> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
> Python
> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
> Go
> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
> TypeScript
> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
> to add support for multi-lang will bring in a lot of features (for example,
> I/O connectors) for free.
>
>
>>
>> Any and all feedback welcome and happy to submit a PR if folks are
>> interested, though the "Swift Way" would be to have it in its own repo so
>> that it can easily be used from the Swift Package Manager.
>>
>
> +1 for creating a PR (may be as a draft initially). Also it'll be easier
> to comment on a PR :)
>
> - Cham
>
> [1]
> [2]
> [3]
>
>
>>
>> Best,
>> B
>>
>>
>>


The Swift SDK now works on Linux

2023-08-17 Thread Byron Ellis via dev
Hello everyone,

I got a chance to test and fix a small issue that prevented the Swift SDK
from working on Linux boxes due to differences in the way Data is handled
in non-macOS Foundation implementations (which hopefully will stop being an
issue with with the native-Swift Foundation revamp). I also added the
generated protobuf similar to how we do with Go so people don't have to do
that on their own. You should now be able to do a checkout of the branch
and run "swift test" if you have a portable runner in the background on
port 8073 on Linux boxes (which also makes it more plausible this SDK would
work on, say, Dataflow)

This was tested with the August 11 release of the Swift 5.9 compiler.

Best,
B


[Request for Feedback] Swift SDK Prototype

2023-08-16 Thread Byron Ellis via dev
Hello everyone,

A couple of months ago I decided that I wanted to really understand how the
Beam FnApi works and how it interacts with the Portable Runner. For me at
least that usually means I need to write some code so I can see things
happening in a debugger and to really prove to myself I understood what was
going on I decided I couldn't use an existing SDK language to do it since
there would be the temptation to read some code and convince myself that I
actually understood what was going on.

One thing led to another and it turns out that to get a minimal FnApi
integration going you end up writing a fair bit of an SDK. So I decided to
take things to a point where I had an SDK that could execute a word count
example via a portable runner backend. I've now reached that point and
would like to submit my prototype SDK to the list for feedback.

It's currently living in a branch on my fork here:

https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift

At the moment it runs via the most recent XCode Beta using Swift 5.9 on
Intel Macs, but should also work using beta builds of 5.9 for Linux running
on Intel hardware. I haven't had a chance to try it on ARM hardware and
make sure all of the endian checks are complete. The
"IntegrationTests.swift" file contains a word count example that reads some
local files (as well as a missing file to exercise DLQ functionality) and
output counts through two separate group by operations to get it past the
"map reduce" size of pipeline. I've tested it against the Python Portable
Runner. Since my goal was to learn FnApi there is no Direct Runner at this
time.

I've shown it to a couple of folks already and incorporated some of that
feedback already (for example pardo was originally called dofn when
defining pipelines). In general I've tried to make the API as "Swift-y" as
possible, hence the heavy reliance on closures and while there aren't yet
composite PTransforms there's the beginnings of what would be needed for a
SwiftUI-like declarative API for creating them.

There are of course a ton of missing bits still to be implemented, like
counters, metrics, windowing, state, timers, etc.

Any and all feedback welcome and happy to submit a PR if folks are
interested, though the "Swift Way" would be to have it in its own repo so
that it can easily be used from the Swift Package Manager.

Best,
B


Re: [DISCUSS] Upgrade vended guava version

2023-07-31 Thread Byron Ellis via dev
It's not designed to be merged, but there is a PR with this change in it---

https://github.com/apache/beam/pull/27695/files#diff-0435a83a413ec063bf7e682cadcd56776cd18fc878f197cc99a65fc231ef2047



On Mon, Jul 31, 2023 at 10:07 AM Ahmet Altay via dev 
wrote:

> Hi Hong,
>
> Thank you for reaching out and thank you for offering to help. If you can
> start the PR and do the testing, one of the committers could help with the
> process.
>
> Thank you!
> Ahmet
>
> On Mon, Jul 31, 2023 at 9:13 AM Hong Teoh  wrote:
>
>> Hi all,
>>
>> The current version of guava that is vended in Beam is
>> com.google.guava:guava:26.0-jre.
>>
>> This version is really old, and has active vulnerabilities [1] [2]
>> [1] https://mvnrepository.com/artifact/com.google.guava/guava/26.0-jre
>> [2] CVE-2023-2976
>> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-2976
>> [3] CVE-2020-8908
>> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-8908
>>
>> Is there anyone else keen on upgrading the vended guava version to match
>> the guava version of 32.1.1-jre ? [4]
>> [4]
>> https://github.com/apache/beam/blame/df6964aac62a521081481b21c96ecd506ea3c503/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L542
>>
>> I am happy to contribute the PR to upgrade the guava dependencies in the
>> Beam repository, but I would need a committer to drive the release of the
>> vended version first! [5]
>> [5]
>> https://docs.google.com/document/d/1ztEoyGkqq9ie5riQxRtMuBu3vb6BUO91mSMn1PU0pDA/edit#heading=h.vhcuqlttpnog
>>
>>
>> Side question: Does anyone know why we have libraries that use the
>> non-vended guava version? [6]
>> [6]
>> https://github.com/search?q=repo%3Aapache%2Fbeam%20library.java.guava=code
>>
>
> @Kenneth Knowles  - might know.
>
>
>>
>>
>>
>> Regards,
>> Hong
>>
>


Re: [DISCUSS] Enable Github Discussions?

2023-07-03 Thread Byron Ellis via dev
-1. This just leads to needless fragmentation not to mention being at the
mercy of a specific technology provider.

On Mon, Jul 3, 2023 at 11:39 AM XQ Hu via dev  wrote:

> +1 with GH discussion.
> If Airflow can do this https://github.com/apache/airflow/discussions, I
> think we can do this as well.
>
> On Mon, Jul 3, 2023 at 9:51 AM Alexey Romanenko 
> wrote:
>
>> -1
>> I understand that for some people, who maybe are not very familiar with
>> ASF and its “Apache Way” [1], it may sound a bit obsolete but mailing lists
>> are one of the key things of every ASF project which Apache Beam is. Having
>> user@, dev@ and commits@ lists are required for ASF project to maintain
>> the open discussions that are publicly accessible and archived in the same
>> way for all ASF projects.
>>
>> I just wanted to remind a key motto at Apache Software Foundation is:
>>   *“If it didn't happen on the mailing list, it didn't happen.”*
>>
>> —
>> Alexey
>>
>> [1] https://apache.org/theapacheway/index.html
>>
>> On 1 Jul 2023, at 19:54, Anand Inguva via dev 
>> wrote:
>>
>> +1 for GitHub discussions as well. But I am also little concerned about
>> multiple places for discussions. As Danny said, if we have a good plan on
>> how to move forward on how/when to archive the current mailing list, that
>> would be great.
>>
>> Thanks,
>> Anand
>>
>> On Sat, Jul 1, 2023, 3:21 AM Damon Douglas 
>> wrote:
>>
>>> I'm very strong +1 for replacing the use of Email with GitHub
>>> Discussions. Thank you for bringing this up.
>>>
>>> On Fri, Jun 30, 2023 at 7:38 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Thanks for starting this discussion!

 I'm a weak -1 for this proposal. While I think that GH Discussions can
 be a good forum, I think most of the things that Discussions do are covered
 by some combination of the dev/user lists and GitHub issues, and the net
 outcome of this will be creating one more forum to pay attention to. I know
 in the past we've had a hard time keeping up with Stack overflow questions
 for a similar reason. With that said, I'm not opposed to trying it out and
 experimenting as long as we have (a) clear criteria for understanding if
 the change is effective or not (can be subjective), (b) a clear idea of
 when we'd revisit the discussion, and (c) a clear path to rollback the
 decision without it being *too *much work (this might mean something
 like disabling future discussions and keeping the history or somehow moving
 the history to the dev or user list). If we do this, I also think we should
 update https://beam.apache.org/community/contact-us/ with a clear
 taxonomy of what goes where (this is what I'm unsure of today).

 FWIW, if we were proposing cutting either the user list or both the
 user and dev list in favor of discussions, I would be +1. I do think the
 advantages of discussions over email are real (threaded, easy to convert
 to/from issues, markdown, one place for all things Beam).

 Thanks,
 Danny

 On Fri, Jun 30, 2023 at 10:23 AM Svetak Sundhar via dev <
 dev@beam.apache.org> wrote:

> Hi all,
>
> I wanted to start a discussion to gauge interest on enabling Github
> Discussions  in
> Apache Beam.
>
> Pros:
> + GH Discussions allows for folks to get unblocked on small/medium
> implementation blocker (Google employees can often get this help by
> scheduling a call with teammates whereas there is a larger barrier for
> non-Google employees to get this help).
> + On the above point, more visibility into the development blockers
> that others have previously faced.
> + GH Discussions is more discoverable and approachable for new users
> and contributors.
> + A centralized place to have discussions. Long term, it makes sense
> to eventually fully migrate to GH Discussions.
>
> Cons:
> - For a period of time when we use both the dev list and GH
> Discussions, context can be confusing.
> - Anything else?
>
> To be clear, I’m not advocating that we move off the dev list
> immediately. I propose that over time we slowly start moving discussions
> over to GH discussions, utilizing things such as the poll feature.
>
> I am aware that the Airflow project [1] uses both GH Discussions today
> and a dev@ list [2] today.
>
> [1] https://github.com/apache/airflow/discussions
> [2] https://lists.apache.org/list.html?d...@airflow.apache.org
>
> Thanks,
>
> Svetak Sundhar
>   Data Engineer
> s vetaksund...@google.com
>
>
>>


Re: Calcite to Beam mapping, possible?

2023-06-18 Thread Byron Ellis via dev
Not knowing what your custom rules are intended to do my first question
would be “do you need them if you can just write SQL.” My second thought
would be if they implement some part of SQL that SqlTransform doesn’t
perhaps contributing them. My third thought would be looking at the
SqlTransform code and using that as a basis for a CalciteTransform.

On Sat, Jun 17, 2023 at 11:59 PM Soumyadeep Mukhopadhyay <
soumyamy...@gmail.com> wrote:

> Hey Byron,
>
> Thanks for your response. My apologies for getting back to you so late.
>
> SqlTransform does seem a potent candidate for my use case but I am unsure
> about one thing - what happens to the custom planner rules that we were
> hoping to build? Shall we rely solely on Beam's implementation of the
> planner rules or is there a way to integrate our own rules too?
>
> If answer to both the questions above is no then the only way I see I
> could use Calcite with Beam would be as a query parsing layer (the parser
> being Calcite) and passing on the SqlString.toString() into SqlTransform
> from the resulting SqlNode. Were you approaching from that point of view?
>
> Thanks again for your help!
>
> With regards,
> Soumyadeep Mukhopadhyay.
>
>
>
>
> On Wed, Jun 14, 2023 at 7:47 PM deepak kumar  wrote:
>
>> I would be really interested to contribute as well on calcite to
>> PTransform kind of conversions
>>
>> On Wed, Jun 14, 2023 at 7:03 PM Byron Ellis via dev 
>> wrote:
>>
>>> Any particular reason you're not using the SqlTransform to do that?
>>> That'd probably be a good place to start code-wise (perhaps with something
>>> a bit newer than 2.29) to see how Beam uses Calcite today to produce a
>>> PTransform from a query plan.
>>>
>>> On Wed, Jun 14, 2023 at 5:32 AM Soumyadeep Mukhopadhyay <
>>> soumyamy...@gmail.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I have a use case where I need Apache Calcite to parse, validate the
>>>> query and then apply some planner rules.
>>>>
>>>> I was hoping to leverage the capabilities of the Apache Beam on top of
>>>> Calcite so that I can build on top of what we have already done, and apply
>>>> it at scale. One approach that came to mind was to either convert the
>>>> RelNode to something equivalent in Beam (like PTransform maybe?
>>>> https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/extensions/sql/impl/rel/BeamRelNode.html
>>>> ).
>>>>
>>>> Any opinion/suggestion on this will be really appreciated! Thanks for
>>>> your time and consideration! If you feel this is not intended forum for me
>>>> to ask, please let me know! :)
>>>>
>>>> With regards,
>>>> Soumyadeep Mukhopadhyay.
>>>>
>>>


Re: Calcite to Beam mapping, possible?

2023-06-14 Thread Byron Ellis via dev
Any particular reason you're not using the SqlTransform to do that? That'd
probably be a good place to start code-wise (perhaps with something a bit
newer than 2.29) to see how Beam uses Calcite today to produce a PTransform
from a query plan.

On Wed, Jun 14, 2023 at 5:32 AM Soumyadeep Mukhopadhyay <
soumyamy...@gmail.com> wrote:

> Hello All,
>
> I have a use case where I need Apache Calcite to parse, validate the query
> and then apply some planner rules.
>
> I was hoping to leverage the capabilities of the Apache Beam on top of
> Calcite so that I can build on top of what we have already done, and apply
> it at scale. One approach that came to mind was to either convert the
> RelNode to something equivalent in Beam (like PTransform maybe?
> https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/extensions/sql/impl/rel/BeamRelNode.html
> ).
>
> Any opinion/suggestion on this will be really appreciated! Thanks for your
> time and consideration! If you feel this is not intended forum for me to
> ask, please let me know! :)
>
> With regards,
> Soumyadeep Mukhopadhyay.
>


Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Byron Ellis via dev
Sure, I get that… though perhaps we should consider just going to something
Avro for portable coding rather than something custom.

On Tue, May 30, 2023 at 2:22 PM Chamikara Jayalath 
wrote:

> Input/output PCollection types at least have to be portable Beam types [1]
> for cross-language to work.
>
> I think we restricted schema-aware transforms to PCollection since
> Row was expected to be an efficient replacement for arbitrary portable Beam
> types (not sure how true that is in practice currently).
>
> Thanks,
> Cham
>
> [1]
> https://github.com/apache/beam/blob/b9730952a7abf60437ee85ba2df6dd30556d6560/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L829
>
> On Tue, May 30, 2023 at 1:54 PM Byron Ellis  wrote:
>
>> Is it actually necessary for a PTransform that is configured via the
>> Schema mechanism to also be one that uses RowCoder? Those strike me as two
>> separate concerns and unnecessarily limiting.
>>
>> On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath 
>> wrote:
>>
>>> +1 for the simplification.
>>>
>>> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw 
>>> wrote:
>>>
 Yeah. Essentially one needs do (1) name the arguments and (2) implement
 the transform. Hopefully (1) could be done in a concise way that allows for
 easy consumption from both Java and cross-language.

>>>
>>> +1 but I think the hard part today is to convert existing PTransforms to
>>> be schema-aware transform compatible (for example, change input/output
>>> types and make sure parameters take Beam Schema compatible types). But this
>>> makes sense for new transforms.
>>>
>>>
>>>
 On Tue, May 30, 2023 at 12:25 PM Byron Ellis 
 wrote:

> Or perhaps the other way around? If you have a Schema we can
> auto-generate the associated builder on the PTransform? Either way, more
> DRY.
>
> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to this simplification, it's a historical artifact that provides
>> no value.
>>
>> I would love it if we also looked into ways to auto-generate the
>> SchemaTransformProvider (e.g. via introspection if a transform takes a
>> small number of arguments, or uses the standard builder pattern...),
>> ideally with something as simple as adding a decorator to the PTransform
>> itself.
>>
>>
>> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hey everyone,
>>>
>>> I was looking at how we use SchemaTransforms in our expansion
>>> service. From what I see, there may be a redundant step in developing
>>> SchemaTransforms. Currently, we have 3 pieces:
>>> - SchemaTransformProvider [1]
>>> - A configuration object
>>> - SchemaTransform [2]
>>>
>>> The API is generally used like this:
>>> 1. The SchemaTransformProvider takes a configuration object and
>>> returns a SchemaTransform
>>> 2. The SchemaTransform is used to build a PTransform according to
>>> the configuration
>>>
>>> In these steps, the SchemaTransform class seems unnecessary. We can
>>> combine the two steps if we have SchemaTransformProvider return the
>>> PTransform directly.
>>>
>>> We can then remove the SchemaTransform class as it will be obsolete.
>>> This should be safe to do; the only place it's used in our API is here 
>>> [3],
>>> and that can be simplified if we make this change (we'd just trim `
>>> .buildTransform()` off the end as `provider.from(configRow)` will
>>> directly return the PTransform).
>>>
>>> I'd like to first mention that I was not involved in the design
>>> process of this API so I may be missing some information on why it was 
>>> set
>>> up this way.
>>>
>>> A few developers already raised questions about how there's
>>> seemingly unnecessary boilerplate involved in making a Java transform
>>> portable. I wasn't involved in the design process of this API so I may 
>>> be
>>> missing some information, but my assumption is this was designed to 
>>> follow
>>> the pattern of the previous iteration of this API (SchemaIO):
>>> SchemaIOProvider[4] -> SchemaIO[5] -> PTransform. However, with the
>>> newer SchemaTransformProvider API, we dropped a few methods and reduced 
>>> the
>>> SchemaTransform class to have a generic buildTransform() method. See the
>>> example of PubsubReadSchemaTransformProvider [6], where the
>>> SchemaTransform interface and buildTransform method are implemented
>>> just to satisfy the requirement that SchemaTransformProvider::from
>>> return a SchemaTransform.
>>>
>>> I'm bringing this up because if we are looking to encourage
>>> contribution to cross-language use cases, we should make it simpler and
>>> less convoluted to develop portable transforms.
>>>
>>> 

Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Byron Ellis via dev
Is it actually necessary for a PTransform that is configured via the Schema
mechanism to also be one that uses RowCoder? Those strike me as two
separate concerns and unnecessarily limiting.

On Tue, May 30, 2023 at 1:29 PM Chamikara Jayalath 
wrote:

> +1 for the simplification.
>
> On Tue, May 30, 2023 at 12:33 PM Robert Bradshaw 
> wrote:
>
>> Yeah. Essentially one needs do (1) name the arguments and (2) implement
>> the transform. Hopefully (1) could be done in a concise way that allows for
>> easy consumption from both Java and cross-language.
>>
>
> +1 but I think the hard part today is to convert existing PTransforms to
> be schema-aware transform compatible (for example, change input/output
> types and make sure parameters take Beam Schema compatible types). But this
> makes sense for new transforms.
>
>
>
>> On Tue, May 30, 2023 at 12:25 PM Byron Ellis 
>> wrote:
>>
>>> Or perhaps the other way around? If you have a Schema we can
>>> auto-generate the associated builder on the PTransform? Either way, more
>>> DRY.
>>>
>>> On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1 to this simplification, it's a historical artifact that provides no
 value.

 I would love it if we also looked into ways to auto-generate the
 SchemaTransformProvider (e.g. via introspection if a transform takes a
 small number of arguments, or uses the standard builder pattern...),
 ideally with something as simple as adding a decorator to the PTransform
 itself.


 On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
 dev@beam.apache.org> wrote:

> Hey everyone,
>
> I was looking at how we use SchemaTransforms in our expansion service.
> From what I see, there may be a redundant step in developing
> SchemaTransforms. Currently, we have 3 pieces:
> - SchemaTransformProvider [1]
> - A configuration object
> - SchemaTransform [2]
>
> The API is generally used like this:
> 1. The SchemaTransformProvider takes a configuration object and
> returns a SchemaTransform
> 2. The SchemaTransform is used to build a PTransform according to the
> configuration
>
> In these steps, the SchemaTransform class seems unnecessary. We can
> combine the two steps if we have SchemaTransformProvider return the
> PTransform directly.
>
> We can then remove the SchemaTransform class as it will be obsolete.
> This should be safe to do; the only place it's used in our API is here 
> [3],
> and that can be simplified if we make this change (we'd just trim `
> .buildTransform()` off the end as `provider.from(configRow)` will
> directly return the PTransform).
>
> I'd like to first mention that I was not involved in the design
> process of this API so I may be missing some information on why it was set
> up this way.
>
> A few developers already raised questions about how there's seemingly
> unnecessary boilerplate involved in making a Java transform portable. I
> wasn't involved in the design process of this API so I may be missing some
> information, but my assumption is this was designed to follow the pattern
> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
> SchemaIO[5] -> PTransform. However, with the newer
> SchemaTransformProvider API, we dropped a few methods and reduced the
> SchemaTransform class to have a generic buildTransform() method. See the
> example of PubsubReadSchemaTransformProvider [6], where the
> SchemaTransform interface and buildTransform method are implemented
> just to satisfy the requirement that SchemaTransformProvider::from
> return a SchemaTransform.
>
> I'm bringing this up because if we are looking to encourage
> contribution to cross-language use cases, we should make it simpler and
> less convoluted to develop portable transforms.
>
> There are a number of SchemaTransforms already developed, but applying
> these changes to them should be straightforward. If people think this is a
> good idea, I can open a PR and implement them.
>
> Best,
> Ahmed
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
> [3]
> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
> [4]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
> [5]
> 

Re: Proposal to reduce the steps to make a Java transform portable

2023-05-30 Thread Byron Ellis via dev
Or perhaps the other way around? If you have a Schema we can auto-generate
the associated builder on the PTransform? Either way, more DRY.

On Tue, May 30, 2023 at 10:59 AM Robert Bradshaw via dev <
dev@beam.apache.org> wrote:

> +1 to this simplification, it's a historical artifact that provides no
> value.
>
> I would love it if we also looked into ways to auto-generate the
> SchemaTransformProvider (e.g. via introspection if a transform takes a
> small number of arguments, or uses the standard builder pattern...),
> ideally with something as simple as adding a decorator to the PTransform
> itself.
>
>
> On Tue, May 30, 2023 at 7:42 AM Ahmed Abualsaud via dev <
> dev@beam.apache.org> wrote:
>
>> Hey everyone,
>>
>> I was looking at how we use SchemaTransforms in our expansion service.
>> From what I see, there may be a redundant step in developing
>> SchemaTransforms. Currently, we have 3 pieces:
>> - SchemaTransformProvider [1]
>> - A configuration object
>> - SchemaTransform [2]
>>
>> The API is generally used like this:
>> 1. The SchemaTransformProvider takes a configuration object and returns a
>> SchemaTransform
>> 2. The SchemaTransform is used to build a PTransform according to the
>> configuration
>>
>> In these steps, the SchemaTransform class seems unnecessary. We can
>> combine the two steps if we have SchemaTransformProvider return the
>> PTransform directly.
>>
>> We can then remove the SchemaTransform class as it will be obsolete. This
>> should be safe to do; the only place it's used in our API is here [3], and
>> that can be simplified if we make this change (we'd just trim `
>> .buildTransform()` off the end as `provider.from(configRow)` will
>> directly return the PTransform).
>>
>> I'd like to first mention that I was not involved in the design process
>> of this API so I may be missing some information on why it was set up this
>> way.
>>
>> A few developers already raised questions about how there's seemingly
>> unnecessary boilerplate involved in making a Java transform portable. I
>> wasn't involved in the design process of this API so I may be missing some
>> information, but my assumption is this was designed to follow the pattern
>> of the previous iteration of this API (SchemaIO): SchemaIOProvider[4] ->
>> SchemaIO[5] -> PTransform. However, with the newer
>> SchemaTransformProvider API, we dropped a few methods and reduced the
>> SchemaTransform class to have a generic buildTransform() method. See the
>> example of PubsubReadSchemaTransformProvider [6], where the
>> SchemaTransform interface and buildTransform method are implemented just
>> to satisfy the requirement that SchemaTransformProvider::from return a
>> SchemaTransform.
>>
>> I'm bringing this up because if we are looking to encourage contribution
>> to cross-language use cases, we should make it simpler and less convoluted
>> to develop portable transforms.
>>
>> There are a number of SchemaTransforms already developed, but applying
>> these changes to them should be straightforward. If people think this is a
>> good idea, I can open a PR and implement them.
>>
>> Best,
>> Ahmed
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransformProvider.java
>> [2]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/SchemaTransform.java
>> [3]
>> https://github.com/apache/beam/blob/d7ded3f07064919c202c81a2c786910e20a834f9/sdks/java/expansion-service/src/main/java/org/apache/beam/sdk/expansion/service/ExpansionServiceSchemaTransformProvider.java#L138
>> [4]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIOProvider.java
>> [5]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/io/SchemaIO.java
>> [6]
>> https://github.com/apache/beam/blob/ed1a297904d5f5c743a6aca1a7648e3fb8f02e18/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubReadSchemaTransformProvider.java#L133-L137
>>
>


Re: Thoughts on coder evolution

2023-05-03 Thread Byron Ellis via dev
I think I'm not understanding the use case here? Are we talking about
encoding of data in motion (e.g. between stages of a streaming pipeline) or
data at rest? (e.g. input formats and output formats) Or maybe something
else?

On Wed, May 3, 2023 at 6:58 AM Jan Lukavský  wrote:

> Hi,
>
> I'd like to discuss a topic, that from time to time appears in different
> contexts (e.g. [1]). I'd like restate the problem in a slightly more
> generic way as: "Should we have a way to completely exchange coder of a
> PCollection/state of a _running_ Pipeline?". First my motivation for
> this question - Beam has an extension called
> beam-sdks-java-extensions-kryo, which contains a KryoCoder. This coder
> uses Kryo [2] to serialize virtually any Java class into binary format.
> Unfortunately, this binary representation differs between Kryo versions
> and it does not contain any way to recognize which version of Kryo was
> used to serialize the data. Attempt to deserialize bytes produced by
> incompatible version of Kryo results in an exception. The current
> version of Kryo that is used by the KryoCoder is already more than 5
> years old and upgrade to newer version is needed, because the current
> version does not work with JDK17+ [3]. Thus, the only option seems to be
> the creation of a different Coder (e.g. Kryo5Coder), but then we need
> the ability to transfer Pipelines using the old KryoCoder to the newer
> one. That is, we need to completely switch coder that encodes
> PCollection and/or state.
>
> We have therefore the following options:
>
>   1) Simply ignore this and let users rerun the Pipeline from scratch.
> This is possible, essentially should be applicable, but if anything
> else, for some Pipelines it might be costly to reprocess all historical
> data.
>
>   2) We can create the new Coder and let users use a runner-specific way
> to convert the Pipeline. E.g. in case of Flink, this could be done by
> converting savepoint into the new format. This requires knowledge of how
> Beam stores state (namespaces) and is kind of involved on the user side.
> We could probably provide runner-specific tools for this, but some
> runners, in general, might not allow such state manipulation.
>
>   3) We can include the information of a Coder update into the Pipeline
> and resubmit it to the runner and let the runner handle it. Upon
> Pipeline restart, a runner would have to convert all state and all
> inflight data from the old Coder to the new one, before resuming the
> Pipeline.
>
> Option 3) seems like the most natural, but it requires support on the
> runner side.
>
> I leave the details on how a runner would do this open, I'm currently
> interested in knowing what is the community's position on this.
>
>   Jan
>
> [1] https://lists.apache.org/thread/z2m1hg4l5k2kb7nhjkv2lnwf8g4t9wps
>
> [2] https://github.com/EsotericSoftware/kryo
>
> [3] https://github.com/EsotericSoftware/kryo/issues/885
>
>


Re: [ANNOUNCE] New committer: Damon Douglas

2023-04-24 Thread Byron Ellis via dev
Congrats Damon!

On Mon, Apr 24, 2023 at 8:57 PM Austin Bennett  wrote:

> thanks for all you do @Damon Douglas  !
>
> On Mon, Apr 24, 2023 at 1:00 PM Robert Burke  wrote:
>
>> Congratulations Damon!!!
>>
>> On Mon, Apr 24, 2023, 12:52 PM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Damon Douglas (damondoug...@apache.org)
>>>
>>> Damon has contributed widely: Beam Katas, playground, infrastructure,
>>> and many IO connectors. Damon does lots of code review in addition to code.
>>> (yes, you can review code as a non-committer!)
>>>
>>> Considering their contributions to the project over this timeframe, the
>>> Beam PMC trusts Damon with the responsibilities of a Beam committer. [1]
>>>
>>> Thank you Damon! And we are looking to see more of your contributions!
>>>
>>> Kenn, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>>
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>


Re: [ANNOUNCE] New committer: Anand Inguva

2023-04-24 Thread Byron Ellis via dev
Congrats Anand!

On Mon, Apr 24, 2023 at 9:54 AM Ahmet Altay via dev 
wrote:

> Congratulations Anand!
>
> On Mon, Apr 24, 2023 at 8:05 AM Kerry Donny-Clark via dev <
> dev@beam.apache.org> wrote:
>
>> Great work Anand, this is well deserved.
>>
>>
>> On Mon, Apr 24, 2023 at 10:35 AM Yi Hu via dev 
>> wrote:
>>
>>> Congrats Anand!
>>>
>>> On Fri, Apr 21, 2023 at 3:54 PM Danielle Syse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations!

 On Fri, Apr 21, 2023 at 3:53 PM Damon Douglas via dev <
 dev@beam.apache.org> wrote:

> Congratulations Anand!
>
> On Fri, Apr 21, 2023 at 12:28 PM Ritesh Ghorse via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations Anand!
>>
>> On Fri, Apr 21, 2023 at 3:24 PM Ahmed Abualsaud via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congrats Anand!
>>>
>>> On Fri, Apr 21, 2023 at 3:18 PM Anand Inguva via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Thanks everyone. Really excited to be a part of Beam Committers.

 On Fri, Apr 21, 2023 at 3:07 PM XQ Hu via dev 
 wrote:

> Congratulations, Anand!!!
>
> On Fri, Apr 21, 2023 at 2:31 PM Jack McCluskey via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations, Anand!
>>
>> On Fri, Apr 21, 2023 at 2:28 PM Valentyn Tymofieiev via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Congratulations!
>>>
>>> On Fri, Apr 21, 2023 at 8:19 PM Jan Lukavský 
>>> wrote:
>>>
 Congrats Anand!
 On 4/21/23 20:05, Robert Burke wrote:

 Congratulations Anand!

 On Fri, Apr 21, 2023, 10:55 AM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> Woohoo, congrats Anand! This is very well deserved!
>
> On Fri, Apr 21, 2023 at 1:54 PM Chamikara Jayalath <
> chamik...@apache.org> wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a
>> new committer: Anand Inguva (ananding...@apache.org)
>>
>> Anand has been contributing to Apache Beam for more than a
>> year and  authored and reviewed more than 100 PRs. Anand has 
>> been a core
>> contributor to Beam Python SDK and drove the efforts to support 
>> Python 3.10
>> and Python 3.11.
>>
>> Considering their contributions to the project over this
>> timeframe, the Beam PMC trusts Anand with the responsibilities 
>> of a Beam
>> committer. [1]
>>
>> Thank you Anand! And we are looking to see more of your
>> contributions!
>>
>> Cham, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>


Re: [DISCUSS] @Experimental, @Internal, @Stable, etc annotations

2023-04-03 Thread Byron Ellis via dev
Honestly, I think APIs could be pretty simply defined if you think of it in
terms of the user:

@Deprecated = this was either stable or evolve but the
functionality/interface will go away at a future date

@Stable = the user of this API opting out of changes to functionality and
interface. For example, default options don't change for a transform
annotated this way.

Evolving (No Annotation) = the user is opting in to changes to
functionality but not to interface. We should generally try to write
backwards compatible code, but on the other hand the release model does not
force users into an upgrade

@Experimental = this functionality / interface might be a bad idea and
could go away at any time


On Mon, Apr 3, 2023 at 7:22 AM Danny McCormick via dev 
wrote:

> *;tldr - I'd like "evolving" to be further defined, specifically around
> how we will make decisions about breaking behavior and API changes*
>
> I don't particularly care what tags we use as long as they're well
> documented. With that said, I think the following framing needs to be
> documented with more definition to flesh out the underlying philosophy:
>
> *>  - new code is changeable/evolving by default (so we don't have to
> always remember to annotate it) but users have confidence they can use it
> in production (because we have good software engineering practices)*
>
> * > - Experimental would be reserved for more risky things*
> * > - after we are confident an API is stable, because it has been the
> same across a couple releases, we mark it*
>
> Here, we have 3 classes of APIs - "experimental", "stable", and "evolving"
> (or alternately "undefined").
>
> "Experimental" seems clear - we can make any changes we want. "Stable" is
> reasonably straightforward as well - we will only make non-breaking changes
> except in exceptional cases (e.g. security hole, total failure of
> functionality, etc...)
>
> With "evolving" is the idea that we can still make any changes we want,
> but we think it's less likely we'll need to? Are silent behavior changes
> acceptable here (my vote would be no)? What about breaking API changes (my
> vote would be rarely)?
>
> I think being able to change our APIs is an ok goal, but outside of a true
> experimental context we should still be weighing the cost of API changes
> against the benefit; we have a problem of people not updating to newer
> SDKs, and introducing more breaking changes will just exacerbate that
> problem. Maybe my concerns are just a consequence of me not really seeing
> the same things that you're seeing, specifically: "*I'm seeing a culture
> of being afraid to change things, even when it would be good for users,
> because our API surface area is far too large and not explicitly chosen.*"
> Mostly what I've seen is a healthy concern about making it hard for users
> to upgrade versions, but my view is probably just limited here.
>
> My ideal framing for "evolving" is: an *evolving* API can make breaking
> API changes between versions, but this will be rare and weighed against the
> cost of slowing users' upgrade process. All breaking changes will be
> communicated in our change log. An *evolving* API will not make silent
> behavior changes except in exceptional cases (e.g. patching a security gap,
> fixing total failures of functionality).
>
> Thanks,
> Danny
>
> On Mon, Apr 3, 2023 at 9:02 AM Jan Lukavský  wrote:
>
>> Hi,
>>
>> removing @Experimental and adding explicit @Stable annotation makes
>> sense to me. FWIW, when we were designing Euphoria API, we adopted the
>> following convention:
>>
>>   - the default stability of "evolving", @Experimental for really
>> experimental code [1]
>>
>>   - target @Audience of API [2] (pipeline author, runner, internal, test)
>>
>>   - and @StateComplexity of operators (PTransforms) [3]
>>
>> The last part is something that was planned to be used by tools that can
>> analyze the Pipeline for performance or visualize which transform(s) are
>> most state-consuming. But this ended only as plans. :)
>>
>>   Jan
>>
>> [1]
>>
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/euphoria/src/main/java/org/apache/beam/sdk/extensions/euphoria/core/annotation/stability/Experimental.java
>>
>> [2]
>>
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/euphoria/src/main/java/org/apache/beam/sdk/extensions/euphoria/core/annotation/audience/Audience.java
>>
>> [3]
>>
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/euphoria/src/main/java/org/apache/beam/sdk/extensions/euphoria/core/annotation/operator/StateComplexity.java
>>
>>
>> On 3/31/23 23:05, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > Long ago, we adopted two annotations in Beam to communicate to users:
>> >
>> >  - `@Experimental` indicates that an API might change
>> >  - `@Internal` indicates that an API is not meant for users.
>> >
>> > I've seen some real problems with this approach:
>> >
>> >  - Users are afraid to use `@Experimental` APIs, because they are

Re: OpenJDK8 / OpenJDK11 container deprecation

2023-02-14 Thread Byron Ellis via dev
FWIW I am Team Upgrade Docker :-)

On Tue, Feb 14, 2023 at 2:53 PM Luke Cwik via user 
wrote:

> I made some progress in testing the container and did hit an issue where
> Ubuntu 22.04 "Jammy" is dependent on the version of Docker installed. It
> turns out that our boot.go crashes with "runtime/cgo: pthread_create
> failed: Operation not permitted" because the Ubuntu 22.04 is using new
> syscalls that Docker 18.09.4 doesn't have a seccomp policy for (and uses a
> default of deny). We have a couple of choices here:
> 1) upgrade the version of docker on Jenkins and require users to similarly
> use a new enough version of Docker so that this isn't an issue for them
> 2) use Ubuntu 20.04 "Focal" as the docker container
>
> I was using Docker 20.10.21 which is why I didn't hit this issue when
> testing the change locally.
>
> We could also do these but they same strictly worse then either of the two
> options discussed above:
> A) disable the seccomp policy on Jenkins
> B) use a custom seccomp policy on Jenkins
>
> My suggestion is to upgrade Docker versions on Jenkins and use Ubuntu
> 22.04 as it will have LTS releases till 2027 and then security patches till
> 2032 which gives everyone the longest runway till we need to swap OS
> versions again for users of Apache Beam. Any concerns or ideas?
>
>
>
> On Thu, Feb 9, 2023 at 10:20 AM Luke Cwik  wrote:
>
>> Our current container java 8 container is 262 MiBs and layers on top of
>> openjdk:8-bullseye which is 226 MiBs compressed while eclipse-temurin:8 is
>> 92 MiBs compressed and eclipse-temurin:8-alpine is 65 MiBs compressed.
>>
>> I would rather not get into issues with C library differences caused by
>> the alpine project so I would stick with the safer option and let users
>> choose alpine when building their custom container if they feel it provides
>> a large win for them. We can always swap to alpine in the future as well if
>> the C library differences become a non-issue.
>>
>> So swapping to eclipse-temurin will save us a bunch on the container size
>> which should help with container transfer and hopefully for startup times
>> as well.
>>
>> On Tue, Feb 7, 2023 at 5:41 PM Andrew Pilloud 
>> wrote:
>>
>>> This sounds reasonable to me as well.
>>>
>>> I've made swaps like this in the past, the base image of each is
>>> probably a bigger factor than the JDK. The openjdk images were based on
>>> Debian 11. The default eclipse-temurin images are based on Ubuntu 22.04
>>> with an alpine option. Ubuntu is a Debian derivative but the versions and
>>> package names aren't exact matches and Ubuntu tends to update a little
>>> faster. For most users I don't think this will matter but users building
>>> custom containers may need to make minor changes. The alpine option will be
>>> much smaller (which could be a significant improvement) but would be a more
>>> significant change to the environment.
>>>
>>> On Tue, Feb 7, 2023 at 5:18 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Seams reasonable to me.

 On Tue, Feb 7, 2023 at 4:19 PM Luke Cwik via user 
 wrote:
 >
 > As per [1], the JDK8 and JDK11 containers that Apache Beam uses have
 stopped being built and supported since July 2022. I have filed [2] to
 track the resolution of this issue.
 >
 > Based upon [1], almost everyone is swapping to the eclipse-temurin
 container[3] as their base based upon the linked issues from the
 deprecation notice[1]. The eclipse-temurin container is released under
 these licenses:
 > Apache License, Version 2.0
 > Eclipse Distribution License 1.0 (BSD)
 > Eclipse Public License 2.0
 > 一 (Secondary) GNU General Public License, version 2 with OpenJDK
 Assembly Exception
 > 一 (Secondary) GNU General Public License, version 2 with the GNU
 Classpath Exception
 >
 > I propose that we swap all our containers to the eclipse-temurin
 containers[3].
 >
 > Open to other ideas and also would be great to hear about your
 experience in any other projects that you have had to make a similar
 decision.
 >
 > 1: https://github.com/docker-library/openjdk/issues/505
 > 2: https://github.com/apache/beam/issues/25371
 > 3: https://hub.docker.com/_/eclipse-temurin

>>>


Convenience methods for Schema definition in Java

2023-01-26 Thread Byron Ellis via dev
Hi all,

I've been spending a lot of time recently working with Schemas in Java on
Beam and I noticed that there aren't convenience methods for creating
nullable versions of a field like there are for the NOT NULL versions. If
folks are interested I made a small quality-of-life PR
 to add those to the Schema
Builder which makes defining schemas with lots of nullable fields somewhat
more readable and with a bit less typing to boot.


Best,
B


Re: Thoughts on extensions/datasketches vs adding to the existing sketching library?

2023-01-20 Thread Byron Ellis via dev
Okay, that makes sense to me. In the case of Datasketches perhaps more
necessary as it turns out it has its own concept of Coders so there's some
"infrastructure" work if you want to ensure binary representation
compatibility. So I can do that for Datasketches and if I'm inspired to
bring in the ones that wouldn't have a dependency I can put those in the
"generic" sketching extension if that makes sense for folks.

On Fri, Jan 20, 2023 at 9:57 AM Kenneth Knowles  wrote:

> Oh and also I want to say that this is awesome and I've wanted to
> integrate with Datasketches for a long time but was saving it for a
> newcomer since it is (hopefully) mostly wrapping them in CombineFns. Thanks
> for doing this! I don't see the different HLL implementations are redundant
> at all - I view each of them not so much as functionality but as a linkage
> with another project / maintainership. So the one that we maintain is the
> least good, the Zetasketch one is about linking to Google/GCP/BigQuery, and
> the Apache Datasketches one is about linking with that very active project.
>
> Kenn
>
> On Fri, Jan 20, 2023 at 9:54 AM Kenneth Knowles  wrote:
>
>> My take: it is useful to isolate dependencies. So, packages that are
>> based on specific other projects like Apache Datasketches benefit from
>> being in their own isolated module in Beam, separate from the
>> Zetasketch-based package.
>>
>> Having a generalized "sketching" package that abstracts away the details
>> so that we can swap out implementation should be a third thing independent
>> of the others IMO and could have some sort of plugin architecture. It is
>> overengineering to do so at this point. And like Byron brought up, a key
>> aspects of sketches is their serialized form being compatible so the user
>> really needs to know exactly what implementation they are using.
>>
>> Kenn
>>
>> On Wed, Jan 18, 2023 at 12:22 PM Byron Ellis via dev 
>> wrote:
>>
>>> Another enhancement/modification to the sketching library might be to
>>> introduce generic encodings for at least the major sketches (HLL, Bloom,
>>> Count-Min) that can translate into the major implementations. Talking with
>>> Kenn it sounds like zetasketch has the side benefit of using an encoding
>>> compatible with BigQuery, but in general I think it would be a nice thing
>>> to let users store the sketch payload in, say, files that they could then
>>> be confident would still be mergeable even if the underlying implementation
>>> of that sketch changed.
>>>
>>> On Wed, Jan 18, 2023 at 11:50 AM Byron Ellis 
>>> wrote:
>>>
>>>> Thanks Luke, my plan was to mostly add ones that didn't already exist.
>>>> I'd also add that there are other techniques (Max-Gumbel Reservoir Sampling
>>>> for example) that aren't in any common library so far as I know that I
>>>> happen to know how to implement which might bias towards the general
>>>> "sketching" library as you say. I generally agree that implementation used
>>>> should be a detail and not something relevant to users.
>>>>
>>>> On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik  wrote:
>>>>
>>>>> I would suggest adding it to the existing package(s) (either
>>>>> sdks/java/extensions or sdks/java/zetasketch or both depending on if 
>>>>> you're
>>>>> replacing existing sketches or adding new ones) since we shouldn't expose
>>>>> sketching libraries API surface. We should make the API take all the
>>>>> relevant parameters since this allows us to move between variants and
>>>>> choose the best sketching library.
>>>>>
>>>>> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> I believe that when zetasketch was added, it was also noticeably more
>>>>>> efficient than other sketch implementations. However this was a number of
>>>>>> years ago, and I don't know whether it still has an advantage or not.
>>>>>>
>>>>>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I was looking at adding at least a couple of the sketches from the
>>>>>>> Apache Datasketches library to the Beam Java SDK and I was wondering if
>>>>>>> folks had a preference f

Re: Thoughts on extensions/datasketches vs adding to the existing sketching library?

2023-01-18 Thread Byron Ellis via dev
Another enhancement/modification to the sketching library might be to
introduce generic encodings for at least the major sketches (HLL, Bloom,
Count-Min) that can translate into the major implementations. Talking with
Kenn it sounds like zetasketch has the side benefit of using an encoding
compatible with BigQuery, but in general I think it would be a nice thing
to let users store the sketch payload in, say, files that they could then
be confident would still be mergeable even if the underlying implementation
of that sketch changed.

On Wed, Jan 18, 2023 at 11:50 AM Byron Ellis  wrote:

> Thanks Luke, my plan was to mostly add ones that didn't already exist. I'd
> also add that there are other techniques (Max-Gumbel Reservoir Sampling for
> example) that aren't in any common library so far as I know that I happen
> to know how to implement which might bias towards the general "sketching"
> library as you say. I generally agree that implementation used should be a
> detail and not something relevant to users.
>
> On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik  wrote:
>
>> I would suggest adding it to the existing package(s) (either
>> sdks/java/extensions or sdks/java/zetasketch or both depending on if you're
>> replacing existing sketches or adding new ones) since we shouldn't expose
>> sketching libraries API surface. We should make the API take all the
>> relevant parameters since this allows us to move between variants and
>> choose the best sketching library.
>>
>> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev 
>> wrote:
>>
>>> I believe that when zetasketch was added, it was also noticeably more
>>> efficient than other sketch implementations. However this was a number of
>>> years ago, and I don't know whether it still has an advantage or not.
>>>
>>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I was looking at adding at least a couple of the sketches from the
>>>> Apache Datasketches library to the Beam Java SDK and I was wondering if
>>>> folks had a preference for adding to the existing "sketching" extension vs
>>>> splitting it out into its own extension?
>>>>
>>>> The reason I ask is that there's some overlap (which already exists in
>>>> zetasketch) between the sketches available in Datasketches vs Beam today,
>>>> particularly HyperLogLog which would have 3 implementations if we were to
>>>> add all of them.
>>>>
>>>> I don't really have a strong opinion, though personally I'd probably
>>>> lean towards a single sketching extension (zetasketch being something of a
>>>> special case as it exists for format compatibility as far as I can tell).
>>>> But I could see how that could be confusing if you had the Apache
>>>> Datasketch implementation and the existing implementation derived from the
>>>> clearspring implementations.
>>>>
>>>> Any thoughts?
>>>>
>>>> Best,
>>>> B
>>>>
>>>


Re: Thoughts on extensions/datasketches vs adding to the existing sketching library?

2023-01-18 Thread Byron Ellis via dev
Thanks Luke, my plan was to mostly add ones that didn't already exist. I'd
also add that there are other techniques (Max-Gumbel Reservoir Sampling for
example) that aren't in any common library so far as I know that I happen
to know how to implement which might bias towards the general "sketching"
library as you say. I generally agree that implementation used should be a
detail and not something relevant to users.

On Wed, Jan 18, 2023 at 11:43 AM Luke Cwik  wrote:

> I would suggest adding it to the existing package(s) (either
> sdks/java/extensions or sdks/java/zetasketch or both depending on if you're
> replacing existing sketches or adding new ones) since we shouldn't expose
> sketching libraries API surface. We should make the API take all the
> relevant parameters since this allows us to move between variants and
> choose the best sketching library.
>
> On Wed, Jan 18, 2023 at 11:24 AM Reuven Lax via dev 
> wrote:
>
>> I believe that when zetasketch was added, it was also noticeably more
>> efficient than other sketch implementations. However this was a number of
>> years ago, and I don't know whether it still has an advantage or not.
>>
>> On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I was looking at adding at least a couple of the sketches from the
>>> Apache Datasketches library to the Beam Java SDK and I was wondering if
>>> folks had a preference for adding to the existing "sketching" extension vs
>>> splitting it out into its own extension?
>>>
>>> The reason I ask is that there's some overlap (which already exists in
>>> zetasketch) between the sketches available in Datasketches vs Beam today,
>>> particularly HyperLogLog which would have 3 implementations if we were to
>>> add all of them.
>>>
>>> I don't really have a strong opinion, though personally I'd probably
>>> lean towards a single sketching extension (zetasketch being something of a
>>> special case as it exists for format compatibility as far as I can tell).
>>> But I could see how that could be confusing if you had the Apache
>>> Datasketch implementation and the existing implementation derived from the
>>> clearspring implementations.
>>>
>>> Any thoughts?
>>>
>>> Best,
>>> B
>>>
>>


Thoughts on extensions/datasketches vs adding to the existing sketching library?

2023-01-18 Thread Byron Ellis via dev
Hi everyone,

I was looking at adding at least a couple of the sketches from the Apache
Datasketches library to the Beam Java SDK and I was wondering if folks had
a preference for adding to the existing "sketching" extension vs splitting
it out into its own extension?

The reason I ask is that there's some overlap (which already exists in
zetasketch) between the sketches available in Datasketches vs Beam today,
particularly HyperLogLog which would have 3 implementations if we were to
add all of them.

I don't really have a strong opinion, though personally I'd probably lean
towards a single sketching extension (zetasketch being something of a
special case as it exists for format compatibility as far as I can tell).
But I could see how that could be confusing if you had the Apache
Datasketch implementation and the existing implementation derived from the
clearspring implementations.

Any thoughts?

Best,
B


Re: Testing Multilanguage Pipelines?

2022-12-28 Thread Byron Ellis via dev
"apache_beam.runners.portability.expansion_service_main",
> ["--fully_qualified_name_glob=*", "--port", "{{PORT}}"])
>
> Here {{PORT}} is auto-populated and can be retrieved to instantiate
> the GRPC connection (or, in your case, passed to the portable runner
> as the endpoint).
>
> Java is very similar, one does
>
> PythonService service =
> new
> PythonService("apache_beam.runners.portability.expansion_service_main",
> ...);
> AutoCloseable running = service.start();
> ...
>
> which should be easily adaptable to starting up a runner. On first use
> this service automatically creates a virtual environment with Beam
> (and other dependencies) installed. (I don't know what the analogue is
> for Go, but it shouldn't be that different...)
>
> The one difficulty with auto-started service is that the release
> artifacts are not necessarily available for a dev repo the same way
> they are with a released version. IIRC, we fall back to the previous
> release in that case. To compensate, and have faster iteration/easier
> testing, one can set an environment variable BEAM_SERVICE_OVERRIDES
> where one specifies an existing venv, jar, or address to use for a
> specific service, see
>
> https://github.com/apache/beam/blob/release-2.43.0/sdks/typescript/src/apache_beam/utils/service.ts#L432
> . This works for Python and typescript; I don't remember if I
> implemented it for Java and I don't think it's yet in Go.
>
> Hopefully this is enough pointers to get started. It'd be great to get
> Java up to snuff.
>
> References:
>
> https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/runners/universal.ts#L34
>
> https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/python.ts#L60
>
> https://github.com/apache/beam/blob/release-2.43.0/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/PythonExternalTransform.java#L466
>
> >> > On Wed, Dec 28, 2022 at 7:50 AM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
> >> >>
> >> >> Given the increasing importance of multi language pipelines, it does
> seem that we should expand the capabilities of the DirectRunner or just go
> all in on FlinkRunner for testing and local / small scale development
> >> >>
> >> >> On Wed, Dec 28, 2022 at 12:47 AM Robert Burke 
> wrote:
> >> >>>
> >> >>> Probably either on Flink, or the Python Portable runner at this
> juncture.
> >> >>>
> >> >>> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev <
> dev@beam.apache.org> wrote:
> >> >>>>
> >> >>>> Hi all,
> >> >>>>
> >> >>>> I spent some more time adding things to my dbt-for-Beam clone (
> https://github.com/apache/beam/pull/24670) and actually made a fair
> amount of progress, including starting to add in the profile support so I
> can start to run it against real workloads (though at the moment only the
> "test" connector is properly configured). More interestingly, though, is
> adding in support for Python Dataframe external transforms... which expands
> properly, but then (unsurprisingly) hangs if you try to actually run the
> pipeline with Java's TestPipeline.
> >> >>>>
> >> >>>> I was wondering how people go about testing Java/Python hybrid
> pipelines locally? The Java<->Python tests don't seem to actually execute a
> pipeline, but I was hoping that maybe the direct runner could be set up
> properly to do that?
> >> >>>>
> >> >>>> Best,
> >> >>>> B
>


Re: Testing Multilanguage Pipelines?

2022-12-28 Thread Byron Ellis via dev
On Wed, Dec 28, 2022 at 9:49 AM Robert Bradshaw  wrote:

> On Wed, Dec 28, 2022 at 4:56 AM Danny McCormick via dev
>  wrote:
> >
> > > Given the increasing importance of multi language pipelines, it does
> seem that we should expand the capabilities of the DirectRunner or just go
> all in on FlinkRunner for testing and local / small scale development
> >
> > +1 - annecdotally I've found local testing of multi-language pipelines
> to be tricky, and have had multiple conversations with others who have run
> into similar challenges in multiple contexts (both users and people working
> on the project).
>
> I generally do all my testing against the Python runner which works
> well. This is, of course, more natural for Python pipelines using
> other languages, but when I was working on typescript which uses
> cross-language even more heavily I just made it auto-start the python
> runner just like the expansion services are auto-started which works
> quite well. (The auto-started runner is just a plain-old portable
> runner speaking the runner API, so no additional support is required
> on the source side once it's started. And if you're already trying to
> use dataframes and/or ML, you need to have Python available anyway.)
>
> We could consider bundling it as a docker image to reduce the required
> dependency set, but we'd have to solve the docker-in-docker issue to
> do that.
>
> I really think it's important to make cross-language a first-class
> citizen--the end use should not care most of the time whether the
> pipelines they use are native or not.
>

Thanks! That's helpful. In this case getting the Python runner to
auto-start sounds like the most straightforward option for testing. After
all it's explicitly to provide Python initiated from Java so Python is
already going to be around and running (and in fact the test auto-starts
the Python expansion service already to get the graph in the first place)
and the deps are already going to be there. I'm personally on the fence
about Docker in these sorts of situations. Yes, it makes life easier for
the most part but gets complicated quickly. It's also not an option for
everyone. I'll give things a shot and report back (if you have an example
of auto-starting the Python runner that'd be cool too---if I get inspired I
might try to add that to the Python extensions in Java since right now they
don't actually appear to be exercising the runner itself based on the TODOs)

Best,
B




>
> > On Wed, Dec 28, 2022 at 7:50 AM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
> >>
> >> Given the increasing importance of multi language pipelines, it does
> seem that we should expand the capabilities of the DirectRunner or just go
> all in on FlinkRunner for testing and local / small scale development
> >>
> >> On Wed, Dec 28, 2022 at 12:47 AM Robert Burke 
> wrote:
> >>>
> >>> Probably either on Flink, or the Python Portable runner at this
> juncture.
> >>>
> >>> On Tue, Dec 27, 2022, 8:40 PM Byron Ellis via dev 
> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I spent some more time adding things to my dbt-for-Beam clone (
> https://github.com/apache/beam/pull/24670) and actually made a fair
> amount of progress, including starting to add in the profile support so I
> can start to run it against real workloads (though at the moment only the
> "test" connector is properly configured). More interestingly, though, is
> adding in support for Python Dataframe external transforms... which expands
> properly, but then (unsurprisingly) hangs if you try to actually run the
> pipeline with Java's TestPipeline.
> >>>>
> >>>> I was wondering how people go about testing Java/Python hybrid
> pipelines locally? The Java<->Python tests don't seem to actually execute a
> pipeline, but I was hoping that maybe the direct runner could be set up
> properly to do that?
> >>>>
> >>>> Best,
> >>>> B
>


Testing Multilanguage Pipelines?

2022-12-27 Thread Byron Ellis via dev
Hi all,

I spent some more time adding things to my dbt-for-Beam clone (
https://github.com/apache/beam/pull/24670) and actually made a fair amount
of progress, including starting to add in the profile support so I can
start to run it against real workloads (though at the moment only the
"test" connector is properly configured). More interestingly, though, is
adding in support for Python Dataframe external transforms... which expands
properly, but then (unsurprisingly) hangs if you try to actually run the
pipeline with Java's TestPipeline.

I was wondering how people go about testing Java/Python hybrid pipelines
locally? The Java<->Python tests don't seem to actually execute a pipeline,
but I was hoping that maybe the direct runner could be set up properly to
do that?

Best,
B


Re: A Declarative API for Apache Beam

2022-12-16 Thread Byron Ellis via dev
 step: *message_xml
>>>> > tag: success
>>>> >   transform:
>>>> > !PTransform:utils.transforms.ValidateXmlDocumentWithXmlSchema
>>>> > schema: /path/to/file.xsd
>>>> > - _messages
>>>> >   label: Convert XMLs
>>>> >   inputs:
>>>> >   - step: *validated_messages
>>>> >   transform:
>>>> > !PTransform:utils.transforms.ConvertXmlDocumentToDictionary
>>>> > schema: /path/to/file.xsd
>>>> > - label: Print XMLs
>>>> >   inputs:
>>>> >   - step: *converted_messages
>>>> >   transform:
>>>> > !PTransform:utils.transforms.Print {}
>>>> >
>>>> > Highlights:
>>>> > Pipeline options are supplied under an options property.
>>>>
>>>> Yep, I was thinking exactly the same:
>>>>
>>>> https://github.com/apache/beam/blob/c5518014d47a42651df94419e3ccbc79eaf96cb3/sdks/python/apache_beam/yaml/main.py#L51
>>>>
>>>> > A pipeline is a flat set of all transforms in the pipeline.
>>>>
>>>> One can certainly enumerate the transforms as a flat set, but I do
>>>> think being able to define a composite structure is nice. In addition,
>>>> the "chain" composite allows one to automatically infer the
>>>> input-output relation rather than having to spell it out (much as one
>>>> can chain multiple transforms in the various SDKs rather than have to
>>>> assign each result to a intermediate).
>>>>
>>>> > Transforms are defined using a YAML tag and named properties and can
>>>> be used by constructing a YAML reference.
>>>>
>>>> That's an interesting idea. Can it be done inline as well?
>>>>
>>>> > DAG construction is done using a simple topological sort of
>>>> transforms and their dependencies.
>>>>
>>>> Same.
>>>>
>>>> > Named side outputs can be referenced using a tag field.
>>>>
>>>> I didn't put this in any of the examples, but I do the same. If a
>>>> transform Foo produces multiple outputs, one can (in fact must)
>>>> reference the various outputs by Foo.output1, Foo.output2, etc.
>>>>
>>>> > Multiple inputs are merged with a Flatten transform.
>>>>
>>>> PTransfoms can have named inputs as well (they're not always
>>>> symmetric), so I let inputs be a map if they care to distinguish them.
>>>>
>>>> > Not sure if there's any inspiration left to take from this, but I
>>>> figured I'd throw it up here to share.
>>>>
>>>> Thanks. It's neat to see others coming up with the same idea, with
>>>> very similar conventions, and validates that it'd be both natural and
>>>> useful.
>>>>
>>>>
>>>> > On Thu, Dec 15, 2022 at 12:48 AM Chamikara Jayalath via dev <
>>>> dev@beam.apache.org> wrote:
>>>> >>
>>>> >> +1 for these proposals and agree that these will simplify and
>>>> demystify Beam for many new users. I think when combined with the
>>>> x-lang/Schema-Aware transform binding, these might end up being adequate
>>>> solutions for many production use-cases as well (unless users need to
>>>> define custom composites, I/O connectors, etc.).
>>>> >>
>>>> >> Also, thanks for providing prototype implementations with examples.
>>>> >>
>>>> >> - Cham
>>>> >>
>>>> >>
>>>> >> On Wed, Dec 14, 2022 at 3:01 PM Sachin Agarwal via dev <
>>>> dev@beam.apache.org> wrote:
>>>> >>>
>>>> >>> To build on Kenn's point, if we leverage existing stuff like dbt we
>>>> get access to a ready made community which can help drive both adoption and
>>>> incremental innovation by bringing more folks to Beam
>>>> >>>
>>>> >>> On Wed, Dec 14, 2022 at 2:57 PM Kenneth Knowles 
>>>> wrote:
>>>> >>>>
>>>> >>>> 1. I love the idea. Back in the early days people talked about an
>>>> "XML SDK" or "JSON SDK" or "YAML SDK" and it didn't really make sense at
>>>> the time. Portability and specifically cross-language schema transforms
>>>> gives the right infrastructure 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Byron Ellis via dev
And I guess also a PR for completeness to make it easier to find going
forward instead of my random repo: https://github.com/apache/beam/pull/24670

On Wed, Dec 14, 2022 at 2:37 PM Byron Ellis  wrote:

> Since Robert opened that can of worms (and we happened to talk about it
> yesterday)... :-)
>
> I figured I'd also share my start on a "port" of dbt to the Beam SDK. This
> would be complementary as it doesn't really provide a way of specifying a
> pipeline, more orchestrating and packaging a complex pipeline---dbt itself
> supports SQL and Python Dataframes, which both seem like reasonable things
> for Beam and it wouldn't be a stretch to include something like the format
> above. Though in my head I had imagined people would tend to write
> composite transforms in the SDK of their choosing that are then exposed at
> this layer. I decided to go with dbt as it also provides a number of nice
> "quality of life" features for its users like documentation, validation,
> environments and so on,
>
> I did a really quick proof-of-viability implementation here:
> https://github.com/byronellis/beam/tree/structured-pipeline-definitions
>
> And you can see a really simple pipeline that reads a seed file (TextIO),
> runs it through a couple of SQLTransforms and then drops it out to a logger
> via a simple DoFn here:
> https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline
>
> I've also heard a rumor there might also be a textproto-based
> representation floating around too :-)
>
> Best,
> B
>
>
>
>
>
> On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev 
> wrote:
>
>> Hello Robert,
>>
>> I'm replying to say that I've been waiting for something like this ever
>> since I started learning Beam and I'm grateful you are pushing this forward.
>>
>> Best,
>>
>> Damon
>>
>> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
>> wrote:
>>
>>> While Beam provides powerful APIs for authoring sophisticated data
>>> processing pipelines, it often still has too high a barrier for
>>> getting started and authoring simple pipelines. Even setting up the
>>> environment, installing the dependencies, and setting up the project
>>> can be an overwhelming amount of boilerplate for some (though
>>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>>> way in making this easier). At the other extreme, the Dataflow project
>>> has the notion of templates which are pre-built Beam pipelines that
>>> can be easily launched from the command line, or even from your
>>> browser, but they are fairly restrictive, limited to pre-assembled
>>> pipelines taking a small number of parameters.
>>>
>>> The idea of creating a yaml-based description of pipelines has come up
>>> several times in several contexts and this last week I decided to code
>>> up what it could look like. Here's a proposal.
>>>
>>> pipeline:
>>>   - type: chain
>>> transforms:
>>>   - type: ReadFromText
>>> args:
>>>  file_pattern: "wordcount.yaml"
>>>   - type: PyMap
>>> fn: "str.lower"
>>>   - type: PyFlatMap
>>> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>>   - type: PyTransform
>>> name: Count
>>> constructor: "apache_beam.transforms.combiners.Count.PerElement"
>>>   - type: PyMap
>>> fn: str
>>>   - type: WriteToText
>>> file_path_prefix: "counts.txt"
>>>
>>> Some more examples at
>>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>>
>>> A prototype (feedback welcome) can be found at
>>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>>
>>> python -m apache_beam.yaml.main --pipeline_spec_file
>>> [path/to/file.yaml] [other_pipene_args]
>>>
>>> or
>>>
>>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>>> [other_pipene_args]
>>>
>>> For example, to play around with this one could do
>>>
>>> python -m apache_beam.yaml.main  \
>>> --pipeline_spec "$(curl
>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>> \
>>> --runner=apache_beam.runners.render.RenderRunner \
>>> --render_out=out.png
>>>
>>> Alternatively one can run it as a docker container with no need to
>>> install any SDK
>>>
>>> docker run --rm \
>>> --entrypoint /usr/local/bin/python \
>>> gcr.io/apache-beam-testing/yaml_template:dev
>>> /dataflow/template/main.py \
>>> --pipeline_spec="$(curl
>>>
>>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>>> )"
>>>
>>> Though of course one would have to set up the appropriate mount points
>>> to do any local filesystem io and/or credentials.
>>>
>>> This is also available as a Dataflow template and can be invoked as
>>>
>>> gcloud dataflow flex-template run \
>>> 

Re: A Declarative API for Apache Beam

2022-12-14 Thread Byron Ellis via dev
Since Robert opened that can of worms (and we happened to talk about it
yesterday)... :-)

I figured I'd also share my start on a "port" of dbt to the Beam SDK. This
would be complementary as it doesn't really provide a way of specifying a
pipeline, more orchestrating and packaging a complex pipeline---dbt itself
supports SQL and Python Dataframes, which both seem like reasonable things
for Beam and it wouldn't be a stretch to include something like the format
above. Though in my head I had imagined people would tend to write
composite transforms in the SDK of their choosing that are then exposed at
this layer. I decided to go with dbt as it also provides a number of nice
"quality of life" features for its users like documentation, validation,
environments and so on,

I did a really quick proof-of-viability implementation here:
https://github.com/byronellis/beam/tree/structured-pipeline-definitions

And you can see a really simple pipeline that reads a seed file (TextIO),
runs it through a couple of SQLTransforms and then drops it out to a logger
via a simple DoFn here:
https://github.com/byronellis/beam/tree/structured-pipeline-definitions/sdks/java/extensions/spd/src/test/resources/simple_pipeline

I've also heard a rumor there might also be a textproto-based
representation floating around too :-)

Best,
B





On Wed, Dec 14, 2022 at 2:21 PM Damon Douglas via dev 
wrote:

> Hello Robert,
>
> I'm replying to say that I've been waiting for something like this ever
> since I started learning Beam and I'm grateful you are pushing this forward.
>
> Best,
>
> Damon
>
> On Wed, Dec 14, 2022 at 2:05 PM Robert Bradshaw 
> wrote:
>
>> While Beam provides powerful APIs for authoring sophisticated data
>> processing pipelines, it often still has too high a barrier for
>> getting started and authoring simple pipelines. Even setting up the
>> environment, installing the dependencies, and setting up the project
>> can be an overwhelming amount of boilerplate for some (though
>> https://beam.apache.org/blog/beam-starter-projects/ has gone a long
>> way in making this easier). At the other extreme, the Dataflow project
>> has the notion of templates which are pre-built Beam pipelines that
>> can be easily launched from the command line, or even from your
>> browser, but they are fairly restrictive, limited to pre-assembled
>> pipelines taking a small number of parameters.
>>
>> The idea of creating a yaml-based description of pipelines has come up
>> several times in several contexts and this last week I decided to code
>> up what it could look like. Here's a proposal.
>>
>> pipeline:
>>   - type: chain
>> transforms:
>>   - type: ReadFromText
>> args:
>>  file_pattern: "wordcount.yaml"
>>   - type: PyMap
>> fn: "str.lower"
>>   - type: PyFlatMap
>> fn: "import re\nlambda line: re.findall('[a-z]+', line)"
>>   - type: PyTransform
>> name: Count
>> constructor: "apache_beam.transforms.combiners.Count.PerElement"
>>   - type: PyMap
>> fn: str
>>   - type: WriteToText
>> file_path_prefix: "counts.txt"
>>
>> Some more examples at
>> https://gist.github.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a
>>
>> A prototype (feedback welcome) can be found at
>> https://github.com/apache/beam/pull/24667. It can be invoked as
>>
>> python -m apache_beam.yaml.main --pipeline_spec_file
>> [path/to/file.yaml] [other_pipene_args]
>>
>> or
>>
>> python -m apache_beam.yaml.main --pipeline_spec [yaml_contents]
>> [other_pipene_args]
>>
>> For example, to play around with this one could do
>>
>> python -m apache_beam.yaml.main  \
>> --pipeline_spec "$(curl
>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>> \
>> --runner=apache_beam.runners.render.RenderRunner \
>> --render_out=out.png
>>
>> Alternatively one can run it as a docker container with no need to
>> install any SDK
>>
>> docker run --rm \
>> --entrypoint /usr/local/bin/python \
>> gcr.io/apache-beam-testing/yaml_template:dev
>> /dataflow/template/main.py \
>> --pipeline_spec="$(curl
>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>>
>> Though of course one would have to set up the appropriate mount points
>> to do any local filesystem io and/or credentials.
>>
>> This is also available as a Dataflow template and can be invoked as
>>
>> gcloud dataflow flex-template run \
>> "yaml-template-job" \
>>  --template-file-gcs-location
>> gs://apache-beam-testing-robertwb/yaml_template.json \
>> --parameters ^~^pipeline_spec="$(curl
>>
>> https://gist.githubusercontent.com/robertwb/0bab10a4ebf1001e187bbe3f5241023a/raw/e08dc4ccdf7c7ec9ea607e530ce6fd8f40109d3a/math.yaml
>> )"
>> \
>> --parameters 

Re: [Proposal] | Move FileIO and TextIO from :sdks:java:core to :sdks:java:io:file

2022-12-14 Thread Byron Ellis via dev
Talk it with a grain of salt since I'm not even a committer, but is perhaps
the reorganization of Beam into smaller components the real work of a 3.0
effort? Splitting of Beam into smaller more independently managed
components would be a pretty huge breaking change from a dependency
management perspective which would potentially be largely separate from any
code changes.

Best,
B

On Wed, Dec 14, 2022 at 9:23 AM Alexey Romanenko 
wrote:

> On 12 Dec 2022, at 22:23, Robert Bradshaw via dev 
> wrote:
>
>
> Saving up all the breaking changes until a major release definitely
> has its downsides (look at Python 3). The migration path is often as
> important (if not more so) than the final destination.
>
>
> Actually, it proves that the major releases *should not* be delayed for a
> long period of time and *should* be issued more often to reduce the
> number of breaking changes (that, of course, likely may happen). That will
> help users to do much more smooth and less risky upgrades, and developers
> to not keep burden forever. Beam 2.0.0 was released back in may 2017 and
> we've almost never talked about Beam 3.0 and what are the criteria for it.
> I understand that it’s a completely different discussion but seems that
> this time has come =)
>
> As for this particular change, I would question how the benefit (it's
> unclear what the exact benefit is--better internal organization?)
> exceeds the pain of making every user refactor their code. I think a
> stronger case can be made for things like the Avro dependency that
> cause real pain.
>
>
> Agree. I think that if it doesn’t bring any pain with additional external
> dependecies and this code is used in almost every other SDK module, then
> there are no reasons for such breaking changes. On the other hand, Avro
> case, that you mentioned above, is a good example why sometimes it would be
> better to keep such code outside of “core”.
>
> As for the pipeline update feature, we've long discussed having
> "pick-your-implementation" transforms that specify alternative,
> equivalent implementations. Upgrades can choose the old one whereas
> new pipelines can get the latest and greatest. It won't solve all
> issues, and requires keeping old codepaths around, but could be an
> important step forward.
>
> On Mon, Dec 12, 2022 at 10:20 AM Kenneth Knowles  wrote:
>
>
> I agree with Mortiz. To answer a few specifics in my own words:
>
> - It is a perfectly sensible refactor, but as a counterpoint without
> file-based IO the SDK isn't functional so it is also a reasonable design
> point to have this included. There are other things in the core SDK that
> are far less "core" and could be moved out with greater benefit. The main
> goal for any separation of modules would be lighter weight transitive
> dependencies, IMO.
>
> - No, Beam has not made any deliberate breaking changes of this nature.
> Hence we are still on major version 2. We have made some bugfixes for data
> loss risks that could be called "breaking changes" but since the feature
> was unsafe to use in the first place we did not bump the major version.
>
> - It is sometimes possible to do such a refactor and have the deprecated
> location proxy to the new location. In this case that seems hard to achieve.
>
> - It is not actually necessary to maintain both locations, as we can
> declare the old location will be unmaintained (but left alone) and all new
> development goes to the new location. That isn't a great choice for users
> who may simply upgrade their SDK version and not notice that their old code
> is now pointing at a version that will not receive e.g. security updates.
>
> - I like the style where if/when we transition from Beam 2 to Beam 3 we
> should have the exact functionality of Beam 3 available as an opt-in flag
> first. So if a user passes --beam-3 they get exactly what will be the
> default functionality when we bump the major version. It really is a
> problem to do a whole bunch of stuff feverishly before a major version
> bump. The other style that I think works well is the linux kernel style
> where major versions alternate between stable and unstable (in other words,
> returning to the 0.x style with every alternating version).
>
> - I do think Beam suffers from fear and inability to do significant code
> gardening. I don't think backwards compatibility in the code sense is the
> biggest blocker. I think the "pipeline update" feature is perhaps the thing
> most holding Beam back from making radical rapid forward progress.
>
> Kenn
>
> On Mon, Dec 12, 2022 at 2:25 AM Moritz Mack  wrote:
>
>
> Hi Damon,
>
>
>
> I fear the current release / versioning strategy of Beam doesn’t lend
> itself well for such breaking changes. Alexey and I have spent quite some
> time discussing how to proceed with the problematic Avro dependency in core
> (and respectively AvroIO, of course).
>
> Such changes essentially always require duplicating code to continue
> supporting a deprecated legacy code path to not 

Re: Credentials Rotation Failure on IO-Datastores cluster

2022-12-01 Thread Byron Ellis via dev
Is there something we need to do here?

On Thu, Dec 1, 2022 at 10:10 AM Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> Something went wrong during the automatic credentials rotation for
> IO-Datastores Cluster, performed at Thu Dec 01 15:00:47 UTC 2022. It may be
> necessary to check the state of the cluster certificates. For further
> details refer to the following links:
>  * https://ci-beam.apache.org/job/beam_SeedJob_Standalone/
>  * https://ci-beam.apache.org/.


Re: [ANNOUNCE] New committer: Yi Hu

2022-11-09 Thread Byron Ellis via dev
Congratulations!

On Wed, Nov 9, 2022 at 11:00 AM Pablo Estrada via dev 
wrote:

> +1 thanks Yi : D
>
> On Wed, Nov 9, 2022 at 10:47 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> Congrats Yi! I've really appreciated the ways you've consistently taken
>> responsibility for improving our team's infra and working through sharp
>> edges in the codebase that others have ignored. This is definitely well
>> deserved!
>>
>> Thanks,
>> Danny
>>
>> On Wed, Nov 9, 2022 at 1:37 PM Anand Inguva via dev 
>> wrote:
>>
>>> Congratulations Yi!
>>>
>>> On Wed, Nov 9, 2022 at 1:35 PM Ritesh Ghorse via dev <
>>> dev@beam.apache.org> wrote:
>>>
 Congratulations Yi!

 On Wed, Nov 9, 2022 at 1:34 PM Ahmed Abualsaud via dev <
 dev@beam.apache.org> wrote:

> Congrats Yi!
>
> On Wed, Nov 9, 2022 at 1:33 PM Sachin Agarwal via dev <
> dev@beam.apache.org> wrote:
>
>> Congratulations Yi!
>>
>> On Wed, Nov 9, 2022 at 10:32 AM Kenneth Knowles 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Yi Hu (y...@apache.org)
>>>
>>> Yi started contributing to Beam in early 2022. Yi's contributions
>>> are very diverse! I/Os, performance tests, Jenkins, support for Schema
>>> logical types. Not only code but a very large amount of code review. Yi 
>>> is
>>> also noted for picking up smaller issues that normally would be left on 
>>> the
>>> backburner and filing issues that he finds rather than ignoring them.
>>>
>>> Considering their contributions to the project over this timeframe,
>>> the Beam PMC trusts Yi with the responsibilities of a Beam committer. 
>>> [1]
>>>
>>> Thank you Yi! And we are looking to see more of your contributions!
>>>
>>> Kenn, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>>
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>


Re: [ANNOUNCE] New committer: Ritesh Ghorse

2022-11-03 Thread Byron Ellis via dev
Congratulations!

On Thu, Nov 3, 2022 at 4:21 PM Austin Bennett 
wrote:

> Congratulations, and Thanks @riteshgho...@apache.org!
>
> On Thu, Nov 3, 2022 at 4:17 PM Sachin Agarwal via dev 
> wrote:
>
>> Congrats Ritesh!
>>
>> On Thu, Nov 3, 2022 at 4:16 PM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Ritesh Ghorse (riteshgho...@apache.org)
>>>
>>> Ritesh started contributing to Beam in mid-2021 and has contributed
>>> immensely to bringin the Go SDK to fruition, in addition to contributions
>>> to Java and Python and release validation.
>>>
>>> Considering their contributions to the project over this timeframe, the
>>> Beam PMC trusts Ritesh with the responsibilities of a Beam committer. [1]
>>>
>>> Thank you Ritesh! And we are looking to see more of your contributions!
>>>
>>> Kenn, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>>
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>


Updated invitation: Beam Community Meetup @ Thu Sep 22, 2022 9:30am - 10:30am (PDT) (dev@beam.apache.org)

2022-09-22 Thread Byron Ellis via dev
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VEVENT
DTSTART:20220922T163000Z
DTEND:20220922T173000Z
DTSTAMP:20220922T151507Z
ORGANIZER;CN=Clowder Events:mailto:c_m8k910vjcu6b4b7udo0kngo4d4@group.calen
 dar.google.com
UID:1c9cvtb8gvoc04m9cm6v1p4...@google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU
 E;CN=dev@beam.apache.org;X-NUM-GUESTS=0:mailto:dev@beam.apache.org
X-MICROSOFT-CDO-OWNERAPPTID:227682351
CLASS:PUBLIC
CREATED:20220912T203958Z
DESCRIPTION:A Practitioner's View of Beam by Byron EllisJoin us via Crowdcast! https://www.crowdcast.io/e/beam-com
 munity-meetup" id="ow2443" __is_owner="true">Register here!In this talk\, Byron will discuss what motivated him to join the Beam 
 team\, where the model benefits data practitioners (data engineering\, mach
 ine learning\, analysts\, etc).He'll share some thoughts on how we can 
 use these concepts to build more scalable and manageable data pipelines.
 sp\;Learn more about this meetup by consulting our\;<
 a href="https://docs.google.com/document/d/1ATyPO0CaDOQ4m6drCT64m9vkbRo0eU4
 dZkxzIY7JB0Y/edit?usp=sharing" class="pastedDriveLink-0">agenda.Remember we'll have a Q\;A session at the end of the meetup!<
 /p>\n\n-::~:~::~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:
 ~:~:~:~:~:~::~:~::-\nView your event at https://calendar.google.com/calenda
 r/event?action=VIEW=MWM5Y3Z0Yjhndm9jMDRtOWNtNnYxcDQycG0gZGV2QGJlYW0uYXB
 hY2hlLm9yZw=NTQjY19tOGs5MTB2amN1NmI0Yjd1ZG8wa25nbzRkNEBncm91cC5jYWxlbmR
 hci5nb29nbGUuY29tMWQzNWFkNzJiZDIwYTU2OThjOTJkM2Q5YzE3OGIzODdhMDI1MjY1Nw
 =America%2FLos_Angeles=en=0.\n\nPlease do not edit this section.\n-::
 ~:~::~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:
 ~::~:~::-
LAST-MODIFIED:20220922T151502Z
LOCATION:https://www.crowdcast.io/e/beam-community-meetup\, US-RDU-MOR200-6
 -E-Gingham (3) [GVC]
SEQUENCE:0
STATUS:CONFIRMED
SUMMARY:Beam Community Meetup
TRANSP:TRANSPARENT
END:VEVENT
END:VCALENDAR


invite.ics
Description: application/ics


Re: [ANNOUNCE] Apache Beam 2.41.0 Released

2022-08-26 Thread Byron Ellis via dev
Thanks Kiley!

On Fri, Aug 26, 2022 at 9:37 AM Ahmet Altay via dev 
wrote:

> Thank you Kiley!
>
> On Fri, Aug 26, 2022 at 6:38 AM P Singh 
> wrote:
>
>> Hi Team,
>>
>> Looking forward to trying and testing the new version, It's always
>> fascinating.
>>
>> On Fri, 26 Aug 2022 at 03:45, Pablo Estrada via user <
>> u...@beam.apache.org> wrote:
>>
>>> Thank you Kiley!
>>>
>>> On Thu, Aug 25, 2022 at 10:55 AM Kiley Sok  wrote:
>>>
 The Apache Beam team is pleased to announce the release of version
 2.41.0.

 Apache Beam is an open source unified programming model to define and
 execute data processing pipelines, including ETL, batch and stream
 (continuous) processing. See https://beam.apache.org

 You can download the release here:

 https://beam.apache.org/get-started/downloads/

 This release includes bug fixes, features, and improvements detailed on
 the Beam blog: https://beam.apache.org/blog/beam-2.41.0/

 Thanks to everyone who contributed to this release, and we hope you
 enjoy using Beam 2.41.0.

 -- Kiley, on behalf of The Apache Beam team

>>>


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-08 Thread Byron Ellis via dev
I like that idea, sort of like Kafka’s Schema Service but for transforms?

On Mon, Aug 8, 2022 at 2:45 PM Robert Bradshaw via dev 
wrote:

> This is a great idea. I would like to approach this from the
> perspective of making it easy to provide a catalog of well-defined
> transforms for use in expansion services from typical SDKs and also
> elsewhere (e.g. for documentation purposes, GUIs, etc.) Ideally
> everything about what a transform is (its config, documentation,
> expectations on inputs, etc.) can be specified programmatically in a
> way that's much easier to both author and consume than it is now.
>
> On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev
>  wrote:
> >
> > Hi All,
> >
> > I believe we can make the multi-language pipelines offering [1] much
> easier to use by updating the expansion service to be fully aware of
> SchemaTransforms. Additionally this will make it easy to
> register/discover/use transforms defined in one SDK from all other SDKs.
> Specifically we could add the following features.
> >
> > Expansion service can be used to easily initialize and expand transforms
> without need for additional code.
> > Expansion service can be used to easily discover already registered
> transforms.
> > Pipeline SDKs can generate user-friendly stub-APIs based on transforms
> registered with an expansion service, eliminating the need to develop
> language-specific wrappers.
> >
> > Please see here for my proposal:
> https://s.apache.org/easy-multi-language
> >
> > Lemme know if you have any comments/questions/suggestions :)
> >
> > Thanks,
> > Cham
> >
> > [1]
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
> >
>


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Byron Ellis via dev
Indeed, there's nothing stopping you from doing codegen where it's useful
but I think it's probably easier to implement codegen from dynamic than it
is to go the other way around (Avro vs Proto)

On Fri, Aug 5, 2022 at 1:15 PM Chamikara Jayalath 
wrote:

>
>
> On Fri, Aug 5, 2022 at 12:00 PM Byron Ellis  wrote:
>
>> I think there are some practical advantages to having the ability to
>> support a dynamic version---at previous places where I've worked having
>> Kafka's Schema Service was incredibly useful for data processing (it was a
>> Java/Scala shop and we mostly used a "decode to POJO" approach rather than
>> codegen.)
>>
>
> Yeah, that's my thought as well. I think it will be pretty useful during
> development/testing cycles, especially if we push code generation to the
> release time. Also, it will be useful for trying out any SchemaTransforms
> developed/released by third parties where generated stubs might not be
> available.
>
>
>>
>> On Fri, Aug 5, 2022 at 10:08 AM Chamikara Jayalath via dev <
>> dev@beam.apache.org> wrote:
>>
>>>
>>>
>>> On Fri, Aug 5, 2022 at 9:44 AM Brian Hulette 
>>> wrote:
>>>
 Thanks Cham! I really like the proposal, I left a few comments. I also
 had one higher-level point I wanted to elevate here:

 > Pipeline SDKs can generate user-friendly stub-APIs based on
 transforms registered with an expansion service, eliminating the need to
 develop language-specific wrappers.
 This would be great! I think one point to consider is whether we can do
 this statically. We could package up these stubs with releases and include
 them in API docs for each language, making them much more discoverable.
 That could be an extension on top of your proposal (e.g. as part of its
 build, each SDK spins up other known expansion services and generates code
 based on the discovery responses), but maybe it could be cleaner if we
 don't really need the dynamic version?

>>>
>>> So my proposal suggested two solutions for wrappers.
>>> * A higher level (dynamic) API (SchemaAwareExternalTransform) that can
>>> be used to discover/initialize/use any SchemaTransform.
>>> * Developing tooling to generate stubs for each language. This is
>>> possible since SchemaTransform gives a cleaner way to define/interpret the
>>> construction API of a transform.
>>>
>>> I think both can be useful. For example, the prior might be useful to
>>> quickly test/try out new SchemaTransforms without going through code
>>> generation.
>>>
>>> Also, I agree with you that it might be good to generate such stubs (and
>>> corresponding docs) during release time instead of generating and
>>> committing stubs to the repo.
>>>
>>> Thanks,
>>> Cham
>>>
>>>

 Brian


 On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
 dev@beam.apache.org> wrote:

> Hi All,
>
> I believe we can make the multi-language pipelines offering [1] much
> easier to use by updating the expansion service to be fully aware of
> SchemaTransforms. Additionally this will make it easy to
> register/discover/use transforms defined in one SDK from all other SDKs.
> Specifically we could add the following features.
>
>- Expansion service can be used to easily initialize and expand
>transforms without need for additional code.
>- Expansion service can be used to easily discover already
>registered transforms.
>- Pipeline SDKs can generate user-friendly stub-APIs based on
>transforms registered with an expansion service, eliminating the need 
> to
>develop language-specific wrappers.
>
> Please see here for my proposal:
> https://s.apache.org/easy-multi-language
>
> Lemme know if you have any comments/questions/suggestions :)
>
> Thanks,
> Cham
>
> [1]
> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>
>


Re: Easy Multi-language via a SchemaTransform-aware Expansion Service

2022-08-05 Thread Byron Ellis via dev
I think there are some practical advantages to having the ability to
support a dynamic version---at previous places where I've worked having
Kafka's Schema Service was incredibly useful for data processing (it was a
Java/Scala shop and we mostly used a "decode to POJO" approach rather than
codegen.)

On Fri, Aug 5, 2022 at 10:08 AM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

>
>
> On Fri, Aug 5, 2022 at 9:44 AM Brian Hulette  wrote:
>
>> Thanks Cham! I really like the proposal, I left a few comments. I also
>> had one higher-level point I wanted to elevate here:
>>
>> > Pipeline SDKs can generate user-friendly stub-APIs based on transforms
>> registered with an expansion service, eliminating the need to develop
>> language-specific wrappers.
>> This would be great! I think one point to consider is whether we can do
>> this statically. We could package up these stubs with releases and include
>> them in API docs for each language, making them much more discoverable.
>> That could be an extension on top of your proposal (e.g. as part of its
>> build, each SDK spins up other known expansion services and generates code
>> based on the discovery responses), but maybe it could be cleaner if we
>> don't really need the dynamic version?
>>
>
> So my proposal suggested two solutions for wrappers.
> * A higher level (dynamic) API (SchemaAwareExternalTransform) that can be
> used to discover/initialize/use any SchemaTransform.
> * Developing tooling to generate stubs for each language. This is possible
> since SchemaTransform gives a cleaner way to define/interpret the
> construction API of a transform.
>
> I think both can be useful. For example, the prior might be useful to
> quickly test/try out new SchemaTransforms without going through code
> generation.
>
> Also, I agree with you that it might be good to generate such stubs (and
> corresponding docs) during release time instead of generating and
> committing stubs to the repo.
>
> Thanks,
> Cham
>
>
>>
>> Brian
>>
>>
>> On Thu, Aug 4, 2022 at 6:51 PM Chamikara Jayalath via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi All,
>>>
>>> I believe we can make the multi-language pipelines offering [1] much
>>> easier to use by updating the expansion service to be fully aware of
>>> SchemaTransforms. Additionally this will make it easy to
>>> register/discover/use transforms defined in one SDK from all other SDKs.
>>> Specifically we could add the following features.
>>>
>>>- Expansion service can be used to easily initialize and expand
>>>transforms without need for additional code.
>>>- Expansion service can be used to easily discover already
>>>registered transforms.
>>>- Pipeline SDKs can generate user-friendly stub-APIs based on
>>>transforms registered with an expansion service, eliminating the need to
>>>develop language-specific wrappers.
>>>
>>> Please see here for my proposal:
>>> https://s.apache.org/easy-multi-language
>>>
>>> Lemme know if you have any comments/questions/suggestions :)
>>>
>>> Thanks,
>>> Cham
>>>
>>> [1]
>>> https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines
>>>
>>>


Re: [ANNOUNCE] New committer: John Casey

2022-07-29 Thread Byron Ellis via dev
Congratulations John!

On Fri, Jul 29, 2022 at 1:09 PM Danny McCormick via dev 
wrote:

> Congrats John and welcome! This is well deserved!
>
> On Fri, Jul 29, 2022 at 4:07 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> John Casey (johnca...@apache.org)
>>
>> John started contributing to Beam in late 2021. John has quickly become
>> our resident expert on KafkaIO - identifying bugs, making enhancements,
>> helping users - in addition to a variety of other contributions.
>>
>> Considering his contributions to the project over this timeframe, the
>> Beam PMC trusts John with the responsibilities of a Beam committer. [1]
>>
>> Thank you John! And we are looking to see more of your contributions!
>>
>> Kenn, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>


Re: Extending 2.41.0 Java snapshot TTL

2022-07-25 Thread Byron Ellis via dev
er/1.12.0-SNAPSHOT/
>>> [5]
>>> https://repository.apache.org/content/groups/snapshots/org/apache/flink/flink-architecture-tests-base/1.15-SNAPSHOT/
>>> [6]
>>> https://repository.apache.org/content/groups/snapshots/org/apache/iotdb/client-cpp-example/0.14.0-SNAPSHOT/
>>>
>>> On Thu, Jul 21, 2022 at 8:54 PM Evan Galpin  wrote:
>>>
>>>> Admittedly this is potentially self-serving, but I feel there could be
>>>> mutual benefit.
>>>>
>>>> I have a similar situation where I want to use pre-release version of
>>>> beam-sdks-java-io-google-cloud-platform. Though I’ve been having
>>>> trouble doing so, a possible alternative solution to using the nightly
>>>> snapshots might be building beam-sdks-java-io-google-cloud-platform
>>>> from source and including the resulting jar as part of the pipeline
>>>> deployment. I’ve successfully done this for direct runner, but not Dataflow
>>>> runner.
>>>>
>>>> Perhaps some others on the thread might be able to shed light on this
>>>> technique (only if applicable to solving the original problem, as I don’t
>>>> intend to thread-hijack).
>>>>
>>>> - Evan
>>>>
>>>> On Thu, Jul 21, 2022 at 19:45 Byron Ellis via dev 
>>>> wrote:
>>>>
>>>>> I think you could change the TTL on the Jenkins side (That sound right
>>>>> to you Danny?) but I'm not sure we could preserve a specific snapshot
>>>>> without keeping all of them...
>>>>>
>>>>> On Thu, Jul 21, 2022 at 4:16 PM Ahmet Altay  wrote:
>>>>>
>>>>>> Thank you for the email Daniel.
>>>>>>
>>>>>> Adding people who could help: @Kenneth Knowles  @Danny
>>>>>> McCormick  @Chamikara Jayalath
>>>>>>  @John Casey  @Byron
>>>>>> Ellis 
>>>>>>
>>>>>> On Thu, Jul 21, 2022, 4:14 PM Daniel Thevessen via dev <
>>>>>> dev@beam.apache.org> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> The Java Firestore connector had a bug
>>>>>>> <https://github.com/apache/beam/issues/22089> recently that needed
>>>>>>> to be fixed. Until the fix can be released as part of 2.41.0, we need 
>>>>>>> one
>>>>>>> of the daily snapshot builds
>>>>>>> <https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-sdks-java-io-google-cloud-platform/2.41.0-SNAPSHOT/>
>>>>>>> as a safe version to use. This is the one from Jul 15
>>>>>>> (2.41.0-20220715.201105-31), which we have checked to be working 
>>>>>>> correctly
>>>>>>> and some have already switched to.
>>>>>>> Unfortunately it looks like these builds get cleared out after a
>>>>>>> while for storage reasons. Would it be possible to extend the TTL on 
>>>>>>> just
>>>>>>> that build, at least until 2.41.0 is released? I'm guessing this would 
>>>>>>> just
>>>>>>> be a settings change for whoever owns the Snapshots repository.
>>>>>>> The change is in beam-sdks-java-io-google-cloud-platform, but I'm
>>>>>>> fairly certain the other Java packages need to be kept as well for
>>>>>>> compatibility.
>>>>>>>
>>>>>>> This is relatively urgent, it looks like the TTL might be weekly so
>>>>>>> that build will be deleted on Saturday.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Daniel Thevessen
>>>>>>>
>>>>>>>
>>>>>>>
>>
>> --
>>
>> Daniel Thevessen  |  Site Reliability Engineer
>>
>> Firestore SRE
>> San Francisco, CA, USA |  +1 (415) 373-7762 <(415)%20373-7762>
>>
>


Re: Extending 2.41.0 Java snapshot TTL

2022-07-21 Thread Byron Ellis via dev
I think you could change the TTL on the Jenkins side (That sound right to
you Danny?) but I'm not sure we could preserve a specific snapshot without
keeping all of them...

On Thu, Jul 21, 2022 at 4:16 PM Ahmet Altay  wrote:

> Thank you for the email Daniel.
>
> Adding people who could help: @Kenneth Knowles  @Danny
> McCormick  @Chamikara Jayalath
>  @John Casey  @Byron Ellis
> 
>
> On Thu, Jul 21, 2022, 4:14 PM Daniel Thevessen via dev <
> dev@beam.apache.org> wrote:
>
>> Hi all,
>>
>> The Java Firestore connector had a bug
>>  recently that needed to be
>> fixed. Until the fix can be released as part of 2.41.0, we need one of the 
>> daily
>> snapshot builds
>> 
>> as a safe version to use. This is the one from Jul 15
>> (2.41.0-20220715.201105-31), which we have checked to be working correctly
>> and some have already switched to.
>> Unfortunately it looks like these builds get cleared out after a while
>> for storage reasons. Would it be possible to extend the TTL on just that
>> build, at least until 2.41.0 is released? I'm guessing this would just be a
>> settings change for whoever owns the Snapshots repository.
>> The change is in beam-sdks-java-io-google-cloud-platform, but I'm fairly
>> certain the other Java packages need to be kept as well for compatibility.
>>
>> This is relatively urgent, it looks like the TTL might be weekly so that
>> build will be deleted on Saturday.
>>
>> Thanks,
>> Daniel Thevessen
>>
>>
>>