[VOTE][BIP-1] Beam Schema Options

2020-02-19 Thread Alex Van Boxel
Hi all,

let's do a vote on the very first Beam Improvement Proposal. If you have a
-1 or -1 (binding) please add your concern to the open issues section to
the wiki. Thanks.

This is the proposal:
https://cwiki.apache.org/confluence/display/BEAM/%5BBIP-1%5D+Beam+Schema+Options

Can I have your votes.

 _/
_/ Alex Van Boxel


Re: Cross-language pipelines status

2020-02-19 Thread Chamikara Jayalath
To clarify my previous point, I think transform
KafkaIO.Read.TypedWithoutMetadata [1] which produces a KV (for
example KV if we use ByteArraySerializer for keys and values)
should work in the current form if we don't have a runner specific override
for the source (hence allowing source and the subsequent DoFn to fuse).

I think issues Chad is running into is due to Flink runner having a
transform override for the source part preventing fusion between the source
and the subsequent DoFn hence having to serialize data across the Fn API
boundary. A SDF-based Kafka source that can work on Fn API without runner
overrides will not run into this issue.

And we already have a Python wrapper [2] for
KafkaIO.Read.TypedWithoutMetadata transform which should work out of the
box when we have the automatic SDF converter [3] as long as the runner
supports SDF.

We might still want to fix the coder issue so that KafkaIO.Read transform
can be directly used as a cross-language transform instead
of KafkaIO.Read.TypedWithoutMetadata.

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1105
[2]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py#L60
[3] https://github.com/apache/beam/pull/10897

On Wed, Feb 19, 2020 at 10:09 PM Robert Bradshaw 
wrote:

> Ah, yes, registering a RowCoder seems like a fine solution here.
> (Either that or have a wrapping PTransform that explicitly converts to
> Rows or similar.)
>
> On Wed, Feb 19, 2020 at 10:03 PM Chad Dombrova  wrote:
> >
> > The Java deps are only half of the problem. The other half is that
> PubsubIO and KafkaIO are using classes that do not have a python equivalent
> and thus no universal coder.  The solution discussed in the issue I linked
> above was to use row coder registries in Java, to convert from these types
> to rows / schemas.
> >
> > Any thoughts on that?
> >
> > -chad
> >
> >
> > On Wed, Feb 19, 2020 at 6:00 PM Robert Bradshaw 
> wrote:
> >>
> >> Hopefully this should be resovled by
> >> https://issues.apache.org/jira/browse/BEAM-9229
> >>
> >> On Wed, Feb 19, 2020 at 5:52 PM Chad Dombrova 
> wrote:
> >> >
> >> > We are using external transforms to get access to PubSubIO within
> python.  It works well, but there is one major issue remaining to fix:  we
> have to build a custom beam with a hack to add the PubSubIO java deps and
> fix up the coders.  This affects KafkaIO as well.  There's an issue here:
> https://issues.apache.org/jira/browse/BEAM-7870
> >> >
> >> > I consider this to be the most pressing problem with external
> transforms right now.
> >> >
> >> > -chad
> >> >
> >> >
> >> >
> >> > On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >> >>>
> >> >>>
> >>  AFAIK, there's no official guide for cross-language pipelines. But
> there are examples and test cases you can use as reference such as:
> >> 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
> >> 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
> >> 
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
> >> 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
> >> >>>
> >> >>>
> >> >>> I'm trying to work with tech writers to add more documentation
> related to cross-language (in a few months). But any help related to
> documenting what we have now is greatly appreciated.
> >> >>>
> >> >>>
> >> >>> That would be great since now the information is a bit scattered
> over different places. I’d be happy to help with any examples and their
> testing that I hope I’ll have after a while.
> >> >>
> >> >>
> >> >> Great.
> >> >>
> >> >>>
> >> >>>
> >>  The runner and SDK supports are in working state I could say but
> not many IOs expose their cross-language interface yet (you can easily
> write cross-language configuration for any Python transforms by yourself
> though).
> >> >>>
> >> >>>
> >> >>> Should mention here the test suites for portable Flink and Spark
> Heejong added recently :)
> >> >>>
> >> >>>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
> >> >>>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
> >> >>>
> >> >>>
> >> >>> Nice! Looks like my question above about cross-language support in
> Spark runner was redundant.
> >> >>>
> >> >>>
> >> 
> >> 
> >> >
> >> > - Is the information here
> https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
> there any other 

Re: Cross-language pipelines status

2020-02-19 Thread Robert Bradshaw
Ah, yes, registering a RowCoder seems like a fine solution here.
(Either that or have a wrapping PTransform that explicitly converts to
Rows or similar.)

On Wed, Feb 19, 2020 at 10:03 PM Chad Dombrova  wrote:
>
> The Java deps are only half of the problem. The other half is that PubsubIO 
> and KafkaIO are using classes that do not have a python equivalent and thus 
> no universal coder.  The solution discussed in the issue I linked above was 
> to use row coder registries in Java, to convert from these types to rows / 
> schemas.
>
> Any thoughts on that?
>
> -chad
>
>
> On Wed, Feb 19, 2020 at 6:00 PM Robert Bradshaw  wrote:
>>
>> Hopefully this should be resovled by
>> https://issues.apache.org/jira/browse/BEAM-9229
>>
>> On Wed, Feb 19, 2020 at 5:52 PM Chad Dombrova  wrote:
>> >
>> > We are using external transforms to get access to PubSubIO within python.  
>> > It works well, but there is one major issue remaining to fix:  we have to 
>> > build a custom beam with a hack to add the PubSubIO java deps and fix up 
>> > the coders.  This affects KafkaIO as well.  There's an issue here: 
>> > https://issues.apache.org/jira/browse/BEAM-7870
>> >
>> > I consider this to be the most pressing problem with external transforms 
>> > right now.
>> >
>> > -chad
>> >
>> >
>> >
>> > On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath  
>> > wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko 
>> >>  wrote:
>> >>>
>> >>>
>>  AFAIK, there's no official guide for cross-language pipelines. But 
>>  there are examples and test cases you can use as reference such as:
>>  https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
>>  https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
>>  https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
>>  https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
>> >>>
>> >>>
>> >>> I'm trying to work with tech writers to add more documentation related 
>> >>> to cross-language (in a few months). But any help related to documenting 
>> >>> what we have now is greatly appreciated.
>> >>>
>> >>>
>> >>> That would be great since now the information is a bit scattered over 
>> >>> different places. I’d be happy to help with any examples and their 
>> >>> testing that I hope I’ll have after a while.
>> >>
>> >>
>> >> Great.
>> >>
>> >>>
>> >>>
>>  The runner and SDK supports are in working state I could say but not 
>>  many IOs expose their cross-language interface yet (you can easily 
>>  write cross-language configuration for any Python transforms by 
>>  yourself though).
>> >>>
>> >>>
>> >>> Should mention here the test suites for portable Flink and Spark Heejong 
>> >>> added recently :)
>> >>>
>> >>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
>> >>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
>> >>>
>> >>>
>> >>> Nice! Looks like my question above about cross-language support in Spark 
>> >>> runner was redundant.
>> >>>
>> >>>
>> 
>> 
>> >
>> > - Is the information here 
>> > https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are 
>> > there any other entry points you can recommend?
>> 
>> 
>>  I think it's up-to-date.
>> >>>
>> >>>
>> >>> Mostly up to date.  Testing status is more complete now and we are 
>> >>> actively working on getting the dependences story correct and adding 
>> >>> support for DataflowRunner.
>> >>>
>> >>>
>> >>> Are there any “umbrella" Jiras regarding cross-language support that I 
>> >>> can track?
>> >>
>> >>
>> >> I don't think we have an umbrella JIRA currently. I can create one and 
>> >> mention it in the roadmap.


Re: Cross-language pipelines status

2020-02-19 Thread Chad Dombrova
The Java deps are only half of the problem. The other half is that PubsubIO
and KafkaIO are using classes that do not have a python equivalent and thus
no universal coder.  The solution discussed in the issue I linked above was
to use row coder registries in Java, to convert from these types to rows /
schemas.

Any thoughts on that?

-chad


On Wed, Feb 19, 2020 at 6:00 PM Robert Bradshaw  wrote:

> Hopefully this should be resovled by
> https://issues.apache.org/jira/browse/BEAM-9229
>
> On Wed, Feb 19, 2020 at 5:52 PM Chad Dombrova  wrote:
> >
> > We are using external transforms to get access to PubSubIO within
> python.  It works well, but there is one major issue remaining to fix:  we
> have to build a custom beam with a hack to add the PubSubIO java deps and
> fix up the coders.  This affects KafkaIO as well.  There's an issue here:
> https://issues.apache.org/jira/browse/BEAM-7870
> >
> > I consider this to be the most pressing problem with external transforms
> right now.
> >
> > -chad
> >
> >
> >
> > On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath 
> wrote:
> >>
> >>
> >>
> >> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko <
> aromanenko@gmail.com> wrote:
> >>>
> >>>
>  AFAIK, there's no official guide for cross-language pipelines. But
> there are examples and test cases you can use as reference such as:
> 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
> 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
> 
> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
> 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
> >>>
> >>>
> >>> I'm trying to work with tech writers to add more documentation related
> to cross-language (in a few months). But any help related to documenting
> what we have now is greatly appreciated.
> >>>
> >>>
> >>> That would be great since now the information is a bit scattered over
> different places. I’d be happy to help with any examples and their testing
> that I hope I’ll have after a while.
> >>
> >>
> >> Great.
> >>
> >>>
> >>>
>  The runner and SDK supports are in working state I could say but not
> many IOs expose their cross-language interface yet (you can easily write
> cross-language configuration for any Python transforms by yourself though).
> >>>
> >>>
> >>> Should mention here the test suites for portable Flink and Spark
> Heejong added recently :)
> >>>
> >>>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
> >>>
> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
> >>>
> >>>
> >>> Nice! Looks like my question above about cross-language support in
> Spark runner was redundant.
> >>>
> >>>
> 
> 
> >
> > - Is the information here
> https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
> there any other entry points you can recommend?
> 
> 
>  I think it's up-to-date.
> >>>
> >>>
> >>> Mostly up to date.  Testing status is more complete now and we are
> actively working on getting the dependences story correct and adding
> support for DataflowRunner.
> >>>
> >>>
> >>> Are there any “umbrella" Jiras regarding cross-language support that I
> can track?
> >>
> >>
> >> I don't think we have an umbrella JIRA currently. I can create one and
> mention it in the roadmap.
>


Re: [PROPOSAL] Preparing for Beam 2.20.0 release

2020-02-19 Thread Ahmet Altay
Curions, was there a resolution on BEAM-9252? Would it be a release blocker?

On Fri, Feb 14, 2020 at 12:42 AM Ismaël Mejía  wrote:

> Thanks Rui for volunteering and for keeping the release pace!
>
> Since we are discussing the next release, I would like to highlight that
> nobody
> apparently is working on this blocker issue:
>
> BEAM-9252 Problem shading Beam pipeline with Beam 2.20.0-SNAPSHOT
> https://issues.apache.org/jira/browse/BEAM-9252
>
> This is a regression introduced by the move to vendored gRPC 1.26.0 and it
> probably will require an extra vendored gRPC release so better to give it
> some priority.
>
>
> On Wed, Feb 12, 2020 at 6:48 PM Ahmet Altay  wrote:
>
>> +1. Thank you.
>>
>> On Tue, Feb 11, 2020 at 11:01 PM Rui Wang  wrote:
>>
>>> Hi all,
>>>
>>> The next (2.20.0) release branch cut is scheduled for 02/26, according
>>> to the calendar
>>> 
>>> .
>>> I would like to volunteer myself to do this release.
>>> The plan is to cut the branch on that date, and cherrypick release-blocking
>>> fixes afterwards if any.
>>>
>>> Any unresolved release blocking JIRA issues for 2.20.0 should have their
>>> "Fix Version/s" marked as "2.20.0".
>>>
>>> Any comments or objections?
>>>
>>>
>>> -Rui
>>>
>>


Re: Cross-language pipelines status

2020-02-19 Thread Robert Bradshaw
Hopefully this should be resovled by
https://issues.apache.org/jira/browse/BEAM-9229

On Wed, Feb 19, 2020 at 5:52 PM Chad Dombrova  wrote:
>
> We are using external transforms to get access to PubSubIO within python.  It 
> works well, but there is one major issue remaining to fix:  we have to build 
> a custom beam with a hack to add the PubSubIO java deps and fix up the 
> coders.  This affects KafkaIO as well.  There's an issue here: 
> https://issues.apache.org/jira/browse/BEAM-7870
>
> I consider this to be the most pressing problem with external transforms 
> right now.
>
> -chad
>
>
>
> On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath  
> wrote:
>>
>>
>>
>> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko  
>> wrote:
>>>
>>>
 AFAIK, there's no official guide for cross-language pipelines. But there 
 are examples and test cases you can use as reference such as:
 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
 https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
 https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
>>>
>>>
>>> I'm trying to work with tech writers to add more documentation related to 
>>> cross-language (in a few months). But any help related to documenting what 
>>> we have now is greatly appreciated.
>>>
>>>
>>> That would be great since now the information is a bit scattered over 
>>> different places. I’d be happy to help with any examples and their testing 
>>> that I hope I’ll have after a while.
>>
>>
>> Great.
>>
>>>
>>>
 The runner and SDK supports are in working state I could say but not many 
 IOs expose their cross-language interface yet (you can easily write 
 cross-language configuration for any Python transforms by yourself though).
>>>
>>>
>>> Should mention here the test suites for portable Flink and Spark Heejong 
>>> added recently :)
>>>
>>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
>>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
>>>
>>>
>>> Nice! Looks like my question above about cross-language support in Spark 
>>> runner was redundant.
>>>
>>>


>
> - Is the information here 
> https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are 
> there any other entry points you can recommend?


 I think it's up-to-date.
>>>
>>>
>>> Mostly up to date.  Testing status is more complete now and we are actively 
>>> working on getting the dependences story correct and adding support for 
>>> DataflowRunner.
>>>
>>>
>>> Are there any “umbrella" Jiras regarding cross-language support that I can 
>>> track?
>>
>>
>> I don't think we have an umbrella JIRA currently. I can create one and 
>> mention it in the roadmap.


Re: Cross-language pipelines status

2020-02-19 Thread Chamikara Jayalath
On Wed, Feb 19, 2020 at 5:52 PM Chad Dombrova  wrote:

> We are using external transforms to get access to PubSubIO within python.
> It works well, but there is one major issue remaining to fix:  we have to
> build a custom beam with a hack to add the PubSubIO java deps and fix up
> the coders.  This affects KafkaIO as well.  There's an issue here:
> https://issues.apache.org/jira/browse/BEAM-7870
>

Seems like this is due to Flink runner translating sources to native Flink
sources in portable path and should not be an issue when we have SDF
(coming very soon) and Flink start supporting it ?

Thanks,
Cham


>
> I consider this to be the most pressing problem with external transforms
> right now.
>
> -chad
>
>
>
> On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath 
> wrote:
>
>>
>>
>> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>>
>>> AFAIK, there's no official guide for cross-language pipelines. But there
 are examples and test cases you can use as reference such as:

 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py

 https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java

 https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java

 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py

>>>
>>> I'm trying to work with tech writers to add more documentation related
>>> to cross-language (in a few months). But any help related to documenting
>>> what we have now is greatly appreciated.
>>>
>>>
>>> That would be great since now the information is a bit scattered over
>>> different places. I’d be happy to help with any examples and their testing
>>> that I hope I’ll have after a while.
>>>
>>
>> Great.
>>
>>
>>>
>>> The runner and SDK supports are in working state I could say but not
 many IOs expose their cross-language interface yet (you can easily write
 cross-language configuration for any Python transforms by yourself though).

>>>
>>> Should mention here the test suites for portable Flink and Spark Heejong
>>> added recently :)
>>>
>>>
>>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
>>>
>>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
>>>
>>>
>>> Nice! Looks like my question above about cross-language support in Spark
>>> runner was redundant.
>>>
>>>
>>>


> - Is the information here
> https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
> there any other entry points you can recommend?
>

 I think it's up-to-date.

>>>
>>> Mostly up to date.  Testing status is more complete now and we are
>>> actively working on getting the dependences story correct and adding
>>> support for DataflowRunner.
>>>
>>>
>>> Are there any “umbrella" Jiras regarding cross-language support that I
>>> can track?
>>>
>>
>> I don't think we have an umbrella JIRA currently. I can create one and
>> mention it in the roadmap.
>>
>


Re: Cross-language pipelines status

2020-02-19 Thread Chad Dombrova
We are using external transforms to get access to PubSubIO within python.
It works well, but there is one major issue remaining to fix:  we have to
build a custom beam with a hack to add the PubSubIO java deps and fix up
the coders.  This affects KafkaIO as well.  There's an issue here:
https://issues.apache.org/jira/browse/BEAM-7870

I consider this to be the most pressing problem with external transforms
right now.

-chad



On Wed, Feb 12, 2020 at 9:28 AM Chamikara Jayalath 
wrote:

>
>
> On Wed, Feb 12, 2020 at 8:10 AM Alexey Romanenko 
> wrote:
>
>>
>> AFAIK, there's no official guide for cross-language pipelines. But there
>>> are examples and test cases you can use as reference such as:
>>>
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py
>>>
>>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOExternalTest.java
>>>
>>> https://github.com/apache/beam/blob/master/runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/ValidateRunnerXlangTest.java
>>>
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_test.py
>>>
>>
>> I'm trying to work with tech writers to add more documentation related to
>> cross-language (in a few months). But any help related to documenting what
>> we have now is greatly appreciated.
>>
>>
>> That would be great since now the information is a bit scattered over
>> different places. I’d be happy to help with any examples and their testing
>> that I hope I’ll have after a while.
>>
>
> Great.
>
>
>>
>> The runner and SDK supports are in working state I could say but not many
>>> IOs expose their cross-language interface yet (you can easily write
>>> cross-language configuration for any Python transforms by yourself though).
>>>
>>
>> Should mention here the test suites for portable Flink and Spark Heejong
>> added recently :)
>>
>>
>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/
>>
>> https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/
>>
>>
>> Nice! Looks like my question above about cross-language support in Spark
>> runner was redundant.
>>
>>
>>
>>>
>>>
 - Is the information here
 https://beam.apache.org/roadmap/connectors-multi-sdk/ up-to-date? Are
 there any other entry points you can recommend?

>>>
>>> I think it's up-to-date.
>>>
>>
>> Mostly up to date.  Testing status is more complete now and we are
>> actively working on getting the dependences story correct and adding
>> support for DataflowRunner.
>>
>>
>> Are there any “umbrella" Jiras regarding cross-language support that I
>> can track?
>>
>
> I don't think we have an umbrella JIRA currently. I can create one and
> mention it in the roadmap.
>


Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-19 Thread Chamikara Jayalath
Congrats Alex!

On Wed, Feb 19, 2020 at 7:21 AM Ryan Skraba  wrote:

> Congratulations Alex!
>
> On Wed, Feb 19, 2020 at 9:52 AM Katarzyna Kucharczyk <
> ka.kucharc...@gmail.com> wrote:
>
>> Great news! Congratulations, Alex! 
>>
>> On Wed, Feb 19, 2020 at 9:14 AM Reza Rokni  wrote:
>>
>>> Fantastic news! Congratulations :-)
>>>
>>> On Wed, 19 Feb 2020 at 07:54, jincheng sun 
>>> wrote:
>>>
 Congratulations!
 Best,
 Jincheng


 Robin Qiu 于2020年2月19日 周三05:52写道:

> Congratulations, Alex!
>
> On Tue, Feb 18, 2020 at 1:48 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congratulations!
>>
>> On Tue, Feb 18, 2020 at 10:38 AM Alex Van Boxel 
>> wrote:
>>
>>> Thank you everyone!
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Tue, Feb 18, 2020 at 7:05 PM  wrote:
>>>
 Congrats Alex!
 Jan


 Dne 18. 2. 2020 18:46 napsal uživatel Thomas Weise >>> >:

 Congratulations!


 On Tue, Feb 18, 2020 at 8:33 AM Ismaël Mejía 
 wrote:

 Congrats Alex! Well done!

 On Tue, Feb 18, 2020 at 5:10 PM Gleb Kanterov 
 wrote:

 Congratulations!

 On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette 
 wrote:

 Congratulations Alex! Well deserved!

 On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada 
 wrote:

 Hi everyone,

 Please join me and the rest of the Beam PMC in welcoming
 a new committer: Alex Van Boxel

 Alex has contributed to Beam in many ways - as an organizer for
 Beam Summit, and meetups - and also with the Protobuf extensions for
 schemas.

 In consideration of his contributions, the Beam PMC trusts him with
 the responsibilities of a Beam committer[1].

 Thanks for your contributions Alex!

 Pablo, on behalf of the Apache Beam PMC.

 [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer


 --

 Best,
 Jincheng
 -
 Twitter: https://twitter.com/sunjincheng121
 -

>>>


Re: Help needed on a problem statement

2020-02-19 Thread Austin Bennett
I'd disentangle Dataflow from Beam.  Beam can help you.  Dataflow might be
useful, though, yes, for batch jobs the spin up cost might be a lot for
small file sizes.

There are potentially lots of ways to do this.

An idea (that I haven't seen used anywhere).  Have a streaming Beam
pipeline (that can autoscale if needed) persistently running.  Have another
process that takes each record from the files as dropped and puts in
message queue for Beam to process (you'd have both the data 'record' as
well as metadata about source file).

* The dataflow spin up is heavy: I'm wondering suitability of running
something like this using Direct runner (or even single node Flink) on GCP
RUN (with an event notification coming from GCS to kickoff job):
https://cloud.google.com/run/quotas <-- looks like can handle up to 2GB in
memory.  So, if not, have some logic for when to launch dataflow, vs when
to do lighter weight beam job.

I've have not faced your problem.  Merely making up what might be
interesting solutions :-)  Good luck!



On Wed, Feb 19, 2020 at 11:10 AM subham agarwal 
wrote:

> Hi Team,
>
> I was working on a problem statement and I came across beam. Being very
> new to beam I am not sure if my use case can be solved by beam. Can you
> please help me here.
>
> Use case:
>
> I have list of CSV and JSON files coming every min in Google cloud
> storage. The file can range from kb to gb. I need to parse the file and
> process records in each file independently, which means file 1 records
> should be parsed and data will be enriched and be stored in different
> output location and file 2 will go into different location.
>
> I started with launching a different dataflow job for each file but it is
> over kill for small files. So, I thought if I can batch files every 15 mins
> and process them together in a single job but I need to maintain the above
> boundary of data processing.
>
>
> Can anyone please help me if there is a solution around my problem or beam
> is not meant for this problem statement.
>
> Thanks in advance.
>
> Looking forward for a reply.
>
> Regards,
> Subham Agarwal
>


Help needed on a problem statement

2020-02-19 Thread subham agarwal
Hi Team,

I was working on a problem statement and I came across beam. Being very new
to beam I am not sure if my use case can be solved by beam. Can you please
help me here.

Use case:

I have list of CSV and JSON files coming every min in Google cloud storage.
The file can range from kb to gb. I need to parse the file and process
records in each file independently, which means file 1 records should be
parsed and data will be enriched and be stored in different output location
and file 2 will go into different location.

I started with launching a different dataflow job for each file but it is
over kill for small files. So, I thought if I can batch files every 15 mins
and process them together in a single job but I need to maintain the above
boundary of data processing.


Can anyone please help me if there is a solution around my problem or beam
is not meant for this problem statement.

Thanks in advance.

Looking forward for a reply.

Regards,
Subham Agarwal


Re: DynamicMessage (proto) schema support PR review please

2020-02-19 Thread Reuven Lax
I will review again, hopefully soon.

On Wed, Feb 19, 2020 at 1:44 AM Alex Van Boxel  wrote:

> Hi all,
>
> can someone give me a LGTM for the DynamicMessage protobuf schema support.
> We've been testing this internally on Dataflow and Direct runner and it
> works.
>
> https://github.com/apache/beam/pull/10502
>
> It uses the same ProtoDomain to get descriptors serialized in the graph
> (as it's needed for going back from Row to Proto).
>
> I like to make it for the 2.20. Don't mind the failed test as it's again a
> flaky test that fails. Thanks.
>
>  _/
> _/ Alex Van Boxel
>


Re: Custom 2.20 failing on Dataflow: what am I doing wrong?

2020-02-19 Thread Alex Van Boxel
That's a great idea. I'll do, as some changes needed to be done to the
gradle files as well

On Wed, Feb 19, 2020, 17:52 Ismaël Mejía  wrote:

> Alex/Gleb can someone of you please add the detailed instructions that
> worked for you in some section of cwiki.
> I have the impression that this will benefit us all at some point.
>
> Thanks,
>
>
>
> On Tue, Feb 18, 2020 at 9:46 AM Alex Van Boxel  wrote:
>
>> Thanks everyone. This really helped a lot. I used Gleb's tip to make it
>> work. Successfully validated my Pull Requests against Dataflow!
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Mon, Feb 17, 2020 at 11:55 PM Brian Hulette 
>> wrote:
>>
>>> I think if you update past [1] this will go away. We had to build a new
>>> worker to use with builds on master after [2]. You should be fine running
>>> from master as long as you aren't using a commit between those two (merge
>>> commits are 6818560 and bde3031, respectively). Setting the Dataflow worker
>>> jar would work too.
>>>
>>> [1] https://github.com/apache/beam/pull/10861
>>> [2] https://github.com/apache/beam/pull/10790
>>>
>>> On Mon, Feb 17, 2020 at 2:14 AM Gleb Kanterov  wrote:
>>>
 You need to pass custom Dataflow worker jar. One of the ways of doing
 that is adding it as a dependency, and using following code snippet:

 opts.setDataflowWorkerJar(
   BatchDataflowWorker.class
   .getProtectionDomain()
   .getCodeSource()
   .getLocation()
   .toString());
 opts.setWorkerHarnessContainerImage("");

 Coming with the disclaimer that it isn't for production :)

 On Mon, Feb 17, 2020 at 8:34 AM Alex Van Boxel 
 wrote:

> Yes, running it manually with the normal parameters as I do for
> production Dataflow. I'm probably a bit ignorant on that, and I
> probably need to provide my own worker.
>
> Thanks for the hint... I'll dive into that.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Feb 17, 2020 at 8:16 AM Reuven Lax  wrote:
>
>> Are you running things manually? This probably means you are using an
>> out-of-date Dataflow worker. I believe that all tests on Jenkins will 
>> build
>> the Dataflow worker from head to prevent exactly this problem.
>>
>> On Sun, Feb 16, 2020 at 11:10 PM Alex Van Boxel 
>> wrote:
>>
>>> Digging further in the traces, it seems like a result of changes to
>>> the model:
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.beam.model.pipeline.v1.StandardWindowFns$SessionsPayload$Enum
>>>
>>> I see changes by Lukasz Cwik. Will this be a problem for the release?
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>>
>>> On Sun, Feb 16, 2020 at 12:11 PM Alex Van Boxel 
>>> wrote:
>>>
 Hey,

 I'm testing my own PR's against Dataflow, something I've done in
 the past with success seem to fail now. I get this error:

 java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.construction.WindowingStrategyTranslation

1.


 Am I doing something wrong?

  _/
 _/ Alex Van Boxel

>>>


Re: Custom 2.20 failing on Dataflow: what am I doing wrong?

2020-02-19 Thread Ismaël Mejía
Alex/Gleb can someone of you please add the detailed instructions that
worked for you in some section of cwiki.
I have the impression that this will benefit us all at some point.

Thanks,



On Tue, Feb 18, 2020 at 9:46 AM Alex Van Boxel  wrote:

> Thanks everyone. This really helped a lot. I used Gleb's tip to make it
> work. Successfully validated my Pull Requests against Dataflow!
>
>  _/
> _/ Alex Van Boxel
>
>
> On Mon, Feb 17, 2020 at 11:55 PM Brian Hulette 
> wrote:
>
>> I think if you update past [1] this will go away. We had to build a new
>> worker to use with builds on master after [2]. You should be fine running
>> from master as long as you aren't using a commit between those two (merge
>> commits are 6818560 and bde3031, respectively). Setting the Dataflow worker
>> jar would work too.
>>
>> [1] https://github.com/apache/beam/pull/10861
>> [2] https://github.com/apache/beam/pull/10790
>>
>> On Mon, Feb 17, 2020 at 2:14 AM Gleb Kanterov  wrote:
>>
>>> You need to pass custom Dataflow worker jar. One of the ways of doing
>>> that is adding it as a dependency, and using following code snippet:
>>>
>>> opts.setDataflowWorkerJar(
>>>   BatchDataflowWorker.class
>>>   .getProtectionDomain()
>>>   .getCodeSource()
>>>   .getLocation()
>>>   .toString());
>>> opts.setWorkerHarnessContainerImage("");
>>>
>>> Coming with the disclaimer that it isn't for production :)
>>>
>>> On Mon, Feb 17, 2020 at 8:34 AM Alex Van Boxel  wrote:
>>>
 Yes, running it manually with the normal parameters as I do for
 production Dataflow. I'm probably a bit ignorant on that, and I
 probably need to provide my own worker.

 Thanks for the hint... I'll dive into that.

  _/
 _/ Alex Van Boxel


 On Mon, Feb 17, 2020 at 8:16 AM Reuven Lax  wrote:

> Are you running things manually? This probably means you are using an
> out-of-date Dataflow worker. I believe that all tests on Jenkins will 
> build
> the Dataflow worker from head to prevent exactly this problem.
>
> On Sun, Feb 16, 2020 at 11:10 PM Alex Van Boxel 
> wrote:
>
>> Digging further in the traces, it seems like a result of changes to
>> the model:
>>
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.beam.model.pipeline.v1.StandardWindowFns$SessionsPayload$Enum
>>
>> I see changes by Lukasz Cwik. Will this be a problem for the release?
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Sun, Feb 16, 2020 at 12:11 PM Alex Van Boxel 
>> wrote:
>>
>>> Hey,
>>>
>>> I'm testing my own PR's against Dataflow, something I've done in the
>>> past with success seem to fail now. I get this error:
>>>
>>> java.lang.NoClassDefFoundError: Could not initialize class
>>> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.construction.WindowingStrategyTranslation
>>>
>>>1.
>>>
>>>
>>> Am I doing something wrong?
>>>
>>>  _/
>>> _/ Alex Van Boxel
>>>
>>


Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-19 Thread Ryan Skraba
Congratulations Alex!

On Wed, Feb 19, 2020 at 9:52 AM Katarzyna Kucharczyk <
ka.kucharc...@gmail.com> wrote:

> Great news! Congratulations, Alex! 
>
> On Wed, Feb 19, 2020 at 9:14 AM Reza Rokni  wrote:
>
>> Fantastic news! Congratulations :-)
>>
>> On Wed, 19 Feb 2020 at 07:54, jincheng sun 
>> wrote:
>>
>>> Congratulations!
>>> Best,
>>> Jincheng
>>>
>>>
>>> Robin Qiu 于2020年2月19日 周三05:52写道:
>>>
 Congratulations, Alex!

 On Tue, Feb 18, 2020 at 1:48 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congratulations!
>
> On Tue, Feb 18, 2020 at 10:38 AM Alex Van Boxel 
> wrote:
>
>> Thank you everyone!
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Tue, Feb 18, 2020 at 7:05 PM  wrote:
>>
>>> Congrats Alex!
>>> Jan
>>>
>>>
>>> Dne 18. 2. 2020 18:46 napsal uživatel Thomas Weise :
>>>
>>> Congratulations!
>>>
>>>
>>> On Tue, Feb 18, 2020 at 8:33 AM Ismaël Mejía 
>>> wrote:
>>>
>>> Congrats Alex! Well done!
>>>
>>> On Tue, Feb 18, 2020 at 5:10 PM Gleb Kanterov 
>>> wrote:
>>>
>>> Congratulations!
>>>
>>> On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette 
>>> wrote:
>>>
>>> Congratulations Alex! Well deserved!
>>>
>>> On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming
>>> a new committer: Alex Van Boxel
>>>
>>> Alex has contributed to Beam in many ways - as an organizer for Beam
>>> Summit, and meetups - and also with the Protobuf extensions for schemas.
>>>
>>> In consideration of his contributions, the Beam PMC trusts him with
>>> the responsibilities of a Beam committer[1].
>>>
>>> Thanks for your contributions Alex!
>>>
>>> Pablo, on behalf of the Apache Beam PMC.
>>>
>>> [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>>
>>> --
>>>
>>> Best,
>>> Jincheng
>>> -
>>> Twitter: https://twitter.com/sunjincheng121
>>> -
>>>
>>


Re: [DISCUSS] Drop support for Flink 1.7

2020-02-19 Thread David Morávek
+1 for dropping 1.7, once we have 1.10 support ready

D.

On Tue, Feb 18, 2020 at 7:01 PM  wrote:

> Hi Ismael,
> yes, sure. The proposal would be to have snapshot dependency in the
> feature branch. The snapshot must be changed to release before merge to
> master.
> Jan
>
> Dne 18. 2. 2020 17:55 napsal uživatel Ismaël Mejía :
>
> Just to be sure, you mean Flink 1.11.0-SNAPSHOT ONLY in the next branch
> dependency?
> We should not have any SNAPSHOT dependency from other projects in Beam.
>
> On Tue, Feb 18, 2020 at 5:34 PM  wrote:
>
> Hi=Cr�Jincheng,
> I think there should be a "process" for this. I would propose to:
>  a) create new branch with support for new (snapshot) flink - currently
> that would mean flink 1.11
>  b) as part of this brach drop support for all version up to N - 3
> I think th!t that dropping all versions and adding new version should be
> atomic, otherwise we risk we release beam version with less than three
> supprted flink versions.
> I'd suggest to start with the 1.10 branch support, include the drop of 1.7
> into that branch. Once 1.10 gets merged, we should create 1.11 with
> snapshot dependency to be able to keep up with the release cadence of flink.
> WDYT?
>  Jan
>
> Dne 18. 2. 2020 15:34 napsal uživatel jincheng sun <
> sunjincheng...@gmail.com>:
>
> Hi folks,
>
> Apache Flink 1.10 has completed the release announcement [1]n Then we
> would like to add Flink 1.10 build target and make Flink Runner compatible
> with Flink 1.10 [2]. So, I would suggest that at most three versions of
> Flink runner for Apache Beam community according to the update Policy of
> Apache Flink releases [3] , i.e. I think it's better to maintain the three
> versions of 1.8/1.9/1.10 after add Flink 1.10 build target to Flink runner.
>
> The current existence of Flink runner 1.7 will affect the upgrade of Flink
> runner 1.8x and 1.9x due to the code of Flink 1.7 is too old, more detail
> can be found in [4]. So,  we need to drop the support of Flink runner 1.7
> as soon as possible.
>
> This discussion also CC to @User, due to the change will affect our users.
> And I would appreciate it if you could review the PR [5].
>
> Welcome any feedback!
>
> Best,
> Jincheng
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ANNOUNCE-Apache-Flin+-1-10-0-released-td37564.html
> 
> [2] https://issues.apache.org/jira/browse/BEAM-9295
> [3] https://flink.apache.org/downloads.html#update-policy-for-old-releases
> [4] https://issues.apache.org/jira/browse/BEAM-9299
> [5] https://github.com/apache/beam/pull/10884
>
>
>
>


DynamicMessage (proto) schema support PR review please

2020-02-19 Thread Alex Van Boxel
Hi all,

can someone give me a LGTM for the DynamicMessage protobuf schema support.
We've been testing this internally on Dataflow and Direct runner and it
works.

https://github.com/apache/beam/pull/10502

It uses the same ProtoDomain to get descriptors serialized in the graph (as
it's needed for going back from Row to Proto).

I like to make it for the 2.20. Don't mind the failed test as it's again a
flaky test that fails. Thanks.

 _/
_/ Alex Van Boxel


Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-19 Thread Katarzyna Kucharczyk
Great news! Congratulations, Alex! 

On Wed, Feb 19, 2020 at 9:14 AM Reza Rokni  wrote:

> Fantastic news! Congratulations :-)
>
> On Wed, 19 Feb 2020 at 07:54, jincheng sun 
> wrote:
>
>> Congratulations!
>> Best,
>> Jincheng
>>
>>
>> Robin Qiu 于2020年2月19日 周三05:52写道:
>>
>>> Congratulations, Alex!
>>>
>>> On Tue, Feb 18, 2020 at 1:48 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 Congratulations!

 On Tue, Feb 18, 2020 at 10:38 AM Alex Van Boxel 
 wrote:

> Thank you everyone!
>
>  _/
> _/ Alex Van Boxel
>
>
> On Tue, Feb 18, 2020 at 7:05 PM  wrote:
>
>> Congrats Alex!
>> Jan
>>
>>
>> Dne 18. 2. 2020 18:46 napsal uživatel Thomas Weise :
>>
>> Congratulations!
>>
>>
>> On Tue, Feb 18, 2020 at 8:33 AM Ismaël Mejía 
>> wrote:
>>
>> Congrats Alex! Well done!
>>
>> On Tue, Feb 18, 2020 at 5:10 PM Gleb Kanterov 
>> wrote:
>>
>> Congratulations!
>>
>> On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette 
>> wrote:
>>
>> Congratulations Alex! Well deserved!
>>
>> On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada 
>> wrote:
>>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming
>> a new committer: Alex Van Boxel
>>
>> Alex has contributed to Beam in many ways - as an organizer for Beam
>> Summit, and meetups - and also with the Protobuf extensions for schemas.
>>
>> In consideration of his contributions, the Beam PMC trusts him with
>> the responsibilities of a Beam committer[1].
>>
>> Thanks for your contributions Alex!
>>
>> Pablo, on behalf of the Apache Beam PMC.
>>
>> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>>
>> --
>>
>> Best,
>> Jincheng
>> -
>> Twitter: https://twitter.com/sunjincheng121
>> -
>>
>


Re: [ANNOUNCE] New committer: Alex Van Boxel

2020-02-19 Thread Reza Rokni
Fantastic news! Congratulations :-)

On Wed, 19 Feb 2020 at 07:54, jincheng sun  wrote:

> Congratulations!
> Best,
> Jincheng
>
>
> Robin Qiu 于2020年2月19日 周三05:52写道:
>
>> Congratulations, Alex!
>>
>> On Tue, Feb 18, 2020 at 1:48 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Tue, Feb 18, 2020 at 10:38 AM Alex Van Boxel 
>>> wrote:
>>>
 Thank you everyone!

  _/
 _/ Alex Van Boxel


 On Tue, Feb 18, 2020 at 7:05 PM  wrote:

> Congrats Alex!
> Jan
>
>
> Dne 18. 2. 2020 18:46 napsal uživatel Thomas Weise :
>
> Congratulations!
>
>
> On Tue, Feb 18, 2020 at 8:33 AM Ismaël Mejía 
> wrote:
>
> Congrats Alex! Well done!
>
> On Tue, Feb 18, 2020 at 5:10 PM Gleb Kanterov 
> wrote:
>
> Congratulations!
>
> On Tue, Feb 18, 2020 at 5:02 PM Brian Hulette 
> wrote:
>
> Congratulations Alex! Well deserved!
>
> On Tue, Feb 18, 2020 at 7:49 AM Pablo Estrada 
> wrote:
>
> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming
> a new committer: Alex Van Boxel
>
> Alex has contributed to Beam in many ways - as an organizer for Beam
> Summit, and meetups - and also with the Protobuf extensions for schemas.
>
> In consideration of his contributions, the Beam PMC trusts him with
> the responsibilities of a Beam committer[1].
>
> Thanks for your contributions Alex!
>
> Pablo, on behalf of the Apache Beam PMC.
>
> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>
>
> --
>
> Best,
> Jincheng
> -
> Twitter: https://twitter.com/sunjincheng121
> -
>