Re: [VOTE][BIP-1] Beam Schema Options

2020-02-28 Thread Alex Van Boxel
Thank you everyone for voting: Accepted by majority vote +1 (7 votes, 3
binding), -1 (0 votes)

I added the information to the document:
https://cwiki.apache.org/confluence/display/BEAM/%5BBIP-1%5D+Beam+Schema+Options

Now that the 2.20 is frozen it's the ideal moment to get the PR of the
interface in, then I can focus on the different implementations:
https://github.com/apache/beam/pull/10413

 _/
_/ Alex Van Boxel


On Wed, Feb 26, 2020 at 9:25 PM Maximilian Michels  wrote:

> +1 (binding)
>
> Looks like a useful feature to have in schemas and fields. Thank you for
> the good write-up.
>
> -Max
>
> On 26.02.20 19:35, Alex Van Boxel wrote:
> > Nico, thanks for the addition to SQL/MED. I could use another PMC vote
> > to conclude the vote.
> >
> >   _/
> > _/ Alex Van Boxel
> >
> >
> > On Mon, Feb 24, 2020 at 11:25 PM Kenneth Knowles  > > wrote:
> >
> > +1 (binding)
> >
> > Added a link to the wiki proposal for SQL/MED (Management of
> > External Data) which treats some of the same ideas.
> >
> > Kenn
> >
> > On Fri, Feb 21, 2020 at 3:02 PM Brian Hulette  > > wrote:
> >
> > +1 (non-binding) thanks for all your work on this Alex :)
> >
> > On Fri, Feb 21, 2020 at 6:50 AM Alex Van Boxel  > > wrote:
> >
> > +1 (non-binding)
> >
> > I assume I can vote on my own proposal :-)
> >
> >   _/
> > _/ Alex Van Boxel
> >
> >
> > On Fri, Feb 21, 2020 at 6:36 AM Jean-Baptiste Onofre
> > mailto:j...@nanthrax.net>> wrote:
> >
> > +1 (binding)
> >
> > Very interesting. It remembers me when we start the
> > discussion about schema support ;)
> >
> > Regards
> > JB
> >
> >> Le 20 févr. 2020 à 08:36, Alex Van Boxel
> >> mailto:a...@vanboxel.be>> a écrit :
> >>
> >> Hi all,
> >>
> >> let's do a vote on the very first Beam Improvement
> >> Proposal. If you have a -1 or -1 (binding) please add
> >> your concern to the open issues section to the wiki.
> >> Thanks.
> >>
> >> This is the proposal:
> >>
> https://cwiki.apache.org/confluence/display/BEAM/%5BBIP-1%5D+Beam+Schema+Options
> >>
> >> Can I have your votes.
> >>
> >>  _/
> >> _/ Alex Van Boxel
> >
>


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-28 Thread Udi Meiri
Welcome Kamil!

On Fri, Feb 28, 2020 at 12:53 PM Mark Liu  wrote:

> Congrats, Kamil!
>
> On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía  wrote:
>
>> Congratulations Kamil!
>>
>> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang  wrote:
>>
>>> Congrats, Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev 
>>> wrote:
>>>
 Congratulations, Kamil!

 On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
 wrote:

> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Kamil Wasilewski
>
> Kamil has contributed to Beam in many ways, including the performance
> testing infrastructure, and a custom BQ source, along with other
> contributions.
>
> In consideration of his contributions, the Beam PMC trusts him with
> the responsibilities of a Beam committer[1].
>
> Thanks for your contributions Kamil!
>
> Pablo, on behalf of the Apache Beam PMC.
>
> [1] https://beam.apache.org/contribute/become-a-committer
> /#an-apache-beam-committer
>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-28 Thread Mark Liu
Congrats, Kamil!

On Fri, Feb 28, 2020 at 12:23 PM Ismaël Mejía  wrote:

> Congratulations Kamil!
>
> On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang  wrote:
>
>> Congrats, Kamil!
>>
>> On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Congratulations, Kamil!
>>>
>>> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada 
>>> wrote:
>>>
 Hi everyone,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Kamil Wasilewski

 Kamil has contributed to Beam in many ways, including the performance
 testing infrastructure, and a custom BQ source, along with other
 contributions.

 In consideration of his contributions, the Beam PMC trusts him with the
 responsibilities of a Beam committer[1].

 Thanks for your contributions Kamil!

 Pablo, on behalf of the Apache Beam PMC.

 [1] https://beam.apache.org/contribute/become-a-committer
 /#an-apache-beam-committer




Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-28 Thread Ismaël Mejía
Congratulations Kamil!

On Fri, Feb 28, 2020 at 7:09 PM Yichi Zhang  wrote:

> Congrats, Kamil!
>
> On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev 
> wrote:
>
>> Congratulations, Kamil!
>>
>> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada  wrote:
>>
>>> Hi everyone,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>>> Kamil Wasilewski
>>>
>>> Kamil has contributed to Beam in many ways, including the performance
>>> testing infrastructure, and a custom BQ source, along with other
>>> contributions.
>>>
>>> In consideration of his contributions, the Beam PMC trusts him with the
>>> responsibilities of a Beam committer[1].
>>>
>>> Thanks for your contributions Kamil!
>>>
>>> Pablo, on behalf of the Apache Beam PMC.
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer
>>> /#an-apache-beam-committer
>>>
>>>


Trying to use local snapshot of beam-sdks-java-core in samples but failed

2020-02-28 Thread Yiru Tang
Hi,

I am playing around adding some code to BigQueryIO. I build my project locally 
and installed the beam-sdks-java-core into my local maven:
mvn install:install-file 
-Dfile=./sdks/java/core/build/libs/beam-sdks-java-core-2.19.0-SNAPSHOT.jar  
-DgroupId=org.apache.beam  -DartifactId=beam-sdks-java-core 
-Dversion=2.19.0-SNAPSHOT -Dpackaging=jar  -DgeneratePom=true

And then try to reference it from word-count-beam by changing the version:
  


  org.apache.beam
  beam-sdks-java-core
  2.19.0-SNAPSHOT


When I ran the sample using dataflow runner:
mvn -Pdataflow-runner compile exec:java   
-Dexec.mainClass=org.apache.beam.examples.NewWriter   
-Dexec.args="--project=bigquerytestdefault --runner=DataflowRunner"

It failed with:
NING] 
java.util.ServiceConfigurationError: 
org.apache.beam.runners.core.construction.CoderTranslatorRegistrar: Provider 
org.apache.beam.runners.core.construction.ModelCoderRegistrar could not be 
instantiated
at java.util.ServiceLoader.fail (ServiceLoader.java:232)
at java.util.ServiceLoader.access$100 (ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService (ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next (ServiceLoader.java:404)
at java.util.ServiceLoader$1.next (ServiceLoader.java:480)
at org.apache.beam.runners.core.construction.CoderTranslation.loadCoderURNs 
(CoderTranslation.java:52)
at org.apache.beam.runners.core.construction.CoderTranslation. 
(CoderTranslation.java:44)
at org.apache.beam.runners.core.construction.SdkComponents.registerCoder 
(SdkComponents.java:254)
at org.apache.beam.runners.core.construction.PCollectionTranslation.toProto 
(PCollectionTranslation.java:35)
at 
org.apache.beam.runners.core.construction.SdkComponents.registerPCollection 
(SdkComponents.java:209)
at 
org.apache.beam.runners.core.construction.PTransformTranslation.translateAppliedPTransform
 (PTransformTranslation.java:471)
at 
org.apache.beam.runners.core.construction.PTransformTranslation$KnownTransformPayloadTranslator.translate
 (PTransformTranslation.java:412)
at org.apache.beam.runners.core.construction.PTransformTranslation.toProto 
(PTransformTranslation.java:225)
at 
org.apache.beam.runners.core.construction.SdkComponents.registerPTransform 
(SdkComponents.java:157)
at 
org.apache.beam.runners.core.construction.PipelineTranslation$1.visitPrimitiveTransform
 (PipelineTranslation.java:87)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit 
(TransformHierarchy.java:665)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit 
(TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit 
(TransformHierarchy.java:657)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600 
(TransformHierarchy.java:317)
at org.apache.beam.sdk.runners.TransformHierarchy.visit 
(TransformHierarchy.java:251)
at org.apache.beam.sdk.Pipeline.traverseTopologically (Pipeline.java:460)
at org.apache.beam.runners.core.construction.PipelineTranslation.toProto 
(PipelineTranslation.java:59)
at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translate 
(DataflowPipelineTranslator.java:187)
at org.apache.beam.runners.dataflow.DataflowRunner.run 
(DataflowRunner.java:795)
at org.apache.beam.runners.dataflow.DataflowRunner.run 
(DataflowRunner.java:186)
at org.apache.beam.sdk.Pipeline.run (Pipeline.java:315)
at org.apache.beam.sdk.Pipeline.run (Pipeline.java:301)
at org.apache.beam.examples.OldWriter.runWrite (OldWriter.java:79)
at org.apache.beam.examples.OldWriter.main (OldWriter.java:86)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: 
org/apache/beam/sdk/util/WindowedValue$ParamWindowedValueCoder
at org.apache.beam.runners.core.construction.ModelCoderRegistrar. 
(ModelCoderRegistrar.java:63)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0 (Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance 
(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance 
(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance (Constructor.java:423)
at java.lang.Class.newInstance (Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService (ServiceLoader.java:380)
at java.util.ServiceLoader$LazyIterator.next (ServiceLoader.java:404)
at java.util.ServiceLoader$1.next (ServiceLoader.java:480)
at org.apache.beam.runners.c

Re: [PROPOSAL] Preparing for Beam 2.20.0 release

2020-02-28 Thread Rui Wang
Release branch 2.20.0 is already cut.

Currently there should be only one blocking Jira:
https://issues.apache.org/jira/browse/BEAM-9288

But there is a newly added Jira:
https://issues.apache.org/jira/browse/BEAM-9322


I will coordinate on those two Jira.



-Rui


On Thu, Feb 27, 2020 at 3:18 PM Rui Wang  wrote:

> Hi community,
>
> Just fyi:
>
> The 2.20.0 release branch should be cut yesterday (02/26) per schedule.
> However as our python precommit was broken so I didn't cut the branch.
>
> I am closely working with PR [1] owner to fix the python precommit. Once
> the fix is in, I will cut the release branch immediately.
>
>
> [1]: https://github.com/apache/beam/pull/10982
>
>
> -Rui
>
> On Thu, Feb 20, 2020 at 7:06 AM Ismaël Mejía  wrote:
>
>> Not yet, up to last check nobody is tackling it, it is still unassigned.
>> Let's
>> not forget that the fix of this one requires an extra release of the grpc
>> vendored dependency (the source of the issue).
>>
>> And yes this is a release blocker for the open source runners because
>> people
>> tend to package their projects with the respective runners in a jar and
>> this is
>> breaking at the moment.
>>
>> Kenn changed the priority of BEAM-9252 from Blocker to Critical to
>> follow the
>> conventions in [1], and from those definitions  'most critical bugs
>> should
>> block release'.
>>
>> [1] https://beam.apache.org/contribute/jira-priorities/
>>
>> On Thu, Feb 20, 2020 at 3:42 AM Ahmet Altay  wrote:
>>
>>> Curions, was there a resolution on BEAM-9252? Would it be a release
>>> blocker?
>>>
>>> On Fri, Feb 14, 2020 at 12:42 AM Ismaël Mejía  wrote:
>>>
 Thanks Rui for volunteering and for keeping the release pace!

 Since we are discussing the next release, I would like to highlight
 that nobody
 apparently is working on this blocker issue:

 BEAM-9252 Problem shading Beam pipeline with Beam 2.20.0-SNAPSHOT
 https://issues.apache.org/jira/browse/BEAM-9252

 This is a regression introduced by the move to vendored gRPC 1.26.0 and
 it
 probably will require an extra vendored gRPC release so better to give
 it
 some priority.


 On Wed, Feb 12, 2020 at 6:48 PM Ahmet Altay  wrote:

> +1. Thank you.
>
> On Tue, Feb 11, 2020 at 11:01 PM Rui Wang  wrote:
>
>> Hi all,
>>
>> The next (2.20.0) release branch cut is scheduled for 02/26,
>> according to the calendar
>> 
>> .
>> I would like to volunteer myself to do this release.
>> The plan is to cut the branch on that date, and cherrypick 
>> release-blocking
>> fixes afterwards if any.
>>
>> Any unresolved release blocking JIRA issues for 2.20.0 should have
>> their "Fix Version/s" marked as "2.20.0".
>>
>> Any comments or objections?
>>
>>
>> -Rui
>>
>


Re: GroupIntoBatches not Working properly for Direct Runner Java

2020-02-28 Thread Kenneth Knowles
What are the timestamps on the elements?

On Fri, Feb 28, 2020 at 8:36 AM Vasu Gupta  wrote:

> Edit: Issue is on Direct Runner(Not Direction Runner - mistyped)
> Issue Details:
> Input data: 7 key-value Packets like: a-1, a-4, b-3, c-5, d-1, e-4, e-5
> Batch Size: 5
> Expected output: a-1,4, b-3, c-5, d-1, e-4,5
> Getting Packets with irregular size like a-1, b-5, e-4,5 OR a-1,4, c-5 etc
> But i always got correct number of packets with BATCH_SIZE = 1
>
> On 2020/02/27 20:40:16, Kenneth Knowles  wrote:
> > Can you share some more details? What is the expected output and what
> > output are you seeing?
> >
> > On Thu, Feb 27, 2020 at 9:39 AM Vasu Gupta 
> wrote:
> >
> > > Hey folks, I am using Apache beam Framework in Java with Direction
> Runner
> > > for local testing purposes. When using GroupIntoBatches with batch
> size 1
> > > it works perfectly fine i.e. the output of the transform is consistent
> and
> > > as expected. But when using with batch size > 1 the output Pcollection
> has
> > > less data than it should be.
> > >
> > > Pipeline flow:
> > > 1. A Transform for reading from pubsub
> > > 2. Transform for making a KV out of the data
> > > 3. A Fixed Window transform of 1 second
> > > 4. Applying GroupIntoBatches transform
> > > 5. And last, Logging the resulting Iterables.
> > >
> > > Weird thing is that it batch_size > 1 works great when running on
> > > DataflowRunner but not with DirectRunner. I think the issue might be
> with
> > > Timer Expiry since GroupIntoBatches uses BagState internally.
> > >
> > > Any help will be much appreciated.
> > >
> >
>


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-28 Thread Yichi Zhang
Congrats, Kamil!

On Fri, Feb 28, 2020 at 9:53 AM Valentyn Tymofieiev 
wrote:

> Congratulations, Kamil!
>
> On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada  wrote:
>
>> Hi everyone,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Kamil Wasilewski
>>
>> Kamil has contributed to Beam in many ways, including the performance
>> testing infrastructure, and a custom BQ source, along with other
>> contributions.
>>
>> In consideration of his contributions, the Beam PMC trusts him with the
>> responsibilities of a Beam committer[1].
>>
>> Thanks for your contributions Kamil!
>>
>> Pablo, on behalf of the Apache Beam PMC.
>>
>> [1] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>>


Re: [ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-28 Thread Valentyn Tymofieiev
Congratulations, Kamil!

On Fri, Feb 28, 2020 at 9:34 AM Pablo Estrada  wrote:

> Hi everyone,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer:
> Kamil Wasilewski
>
> Kamil has contributed to Beam in many ways, including the performance
> testing infrastructure, and a custom BQ source, along with other
> contributions.
>
> In consideration of his contributions, the Beam PMC trusts him with the
> responsibilities of a Beam committer[1].
>
> Thanks for your contributions Kamil!
>
> Pablo, on behalf of the Apache Beam PMC.
>
> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
> committer
>
>


Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.10.8 RC2

2020-02-28 Thread Kenneth Knowles
+1 (binding)

On Fri, Feb 28, 2020 at 8:19 AM Ismaël Mejía  wrote:

> +1 (binding)
>
> On Wed, Feb 26, 2020 at 10:28 PM Robert Bradshaw 
> wrote:
>
>> +1 (binding)
>>
>> On Wed, Feb 26, 2020 at 1:11 PM Pablo Estrada  wrote:
>> >
>> > +1 (binding)
>> > Verified hashes.
>> > Thank you Ismael!
>> >
>> > On Wed, Feb 26, 2020 at 11:30 AM Luke Cwik  wrote:
>> >>
>> >> +1 (binding)
>> >>
>> >> Verified signatures and contents of jar to not contain
>> module-info.class
>> >>
>> >> On Wed, Feb 26, 2020 at 10:45 AM Kai Jiang  wrote:
>> >>>
>> >>> +1 (non-binding)
>> >>>
>> >>> On Wed, Feb 26, 2020 at 01:23 Ismaël Mejía  wrote:
>> 
>>  Please review the release of the following artifacts that we vendor:
>>   * beam-vendor-bytebuddy-1_10_8
>> 
>>  Hi everyone,
>>  Please review and vote on the release candidate #2 for the version
>> 0.1, as follows:
>>  [ ] +1, Approve the release
>>  [ ] -1, Do not approve the release (please provide specific comments)
>> 
>>  The complete staging area is available for your review, which
>> includes:
>>  * the official Apache source release to be deployed to
>> dist.apache.org [1], which is signed with the key with fingerprint
>> 3415631729E15B33051ADB670A9DAF6713B86349 [2],
>>  * all artifacts to be deployed to the Maven Central Repository [3],
>>  * commit hash "63492776f154464f67533a6059f162e6b8cf7315" [4],
>> 
>>  The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>> 
>>  Thanks,
>>  Release Manager
>> 
>>  [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>  [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>  [3]
>> https://repository.apache.org/content/repositories/orgapachebeam-1094/
>>  [4]
>> https://github.com/apache/beam/commit/63492776f154464f67533a6059f162e6b8cf7315
>> 
>>
>


[ANNOUNCE] New Committer: Kamil Wasilewski

2020-02-28 Thread Pablo Estrada
Hi everyone,

Please join me and the rest of the Beam PMC in welcoming a new committer:
Kamil Wasilewski

Kamil has contributed to Beam in many ways, including the performance
testing infrastructure, and a custom BQ source, along with other
contributions.

In consideration of his contributions, the Beam PMC trusts him with the
responsibilities of a Beam committer[1].

Thanks for your contributions Kamil!

Pablo, on behalf of the Apache Beam PMC.

[1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
committer


Re: Docker images are migrated to Apache org.

2020-02-28 Thread Hannah Jiang
>
> Great news, Thanks Hannah for all your attention to this issue, let's not
> forget to thank the INFRA guys for their help.

Great point, I left additional comment to the ticket.



On Thu, Feb 27, 2020 at 9:06 AM Kyle Weaver  wrote:

> Hi Ismael, thanks for bringing that up. Can you please start a new thread
> for the image cleanup issue, including details on which images you mean? I
> want to make sure it has sufficient visibility.
>
> On Thu, Feb 27, 2020 at 7:23 AM Ismaël Mejía  wrote:
>
>> Independent to this thread, but slightly related.
>>
>> I ended up looking at the gcr.io images we use for testing and noticed
>> that there are literally hundreds of them, do we have clean up jobs for
>> those?
>> Can anyone take a look so we don't end up wasting resources there too.
>>
>>
>>
>> On Thu, Feb 27, 2020 at 2:30 PM Ismaël Mejía  wrote:
>>
>>> Great news, Thanks Hannah for all your attention to this issue, let's
>>> not forget to thank the INFRA guys for their help.
>>>
>>> On Tue, Feb 25, 2020 at 2:03 AM Hannah Jiang 
>>> wrote:
>>>
 Thanks everyone for reporting issues and potential problems etc.
 Please feel free to let me know if you have any questions or see
 additional issues!


 On Mon, Feb 24, 2020 at 4:17 PM Pablo Estrada 
 wrote:

> Thanks Hannah! This is great : )
> -P.
>
> On Mon, Feb 24, 2020 at 12:47 PM Robert Burke 
> wrote:
>
>> It was only merged in an hour ago. That explains why I still saw it
>> as broken before lunch. Thanks again!
>>
>> On Mon, Feb 24, 2020, 12:42 PM Robert Burke 
>> wrote:
>>
>>> NVM. I had stale pages for some reason. Hannah fixed them already. :D
>>>
>>> On Mon, Feb 24, 2020, 12:39 PM Robert Burke 
>>> wrote:
>>>
 Looks like the change broke the go post commits. I've filed
 BEAM-9374 for it. I think I know how to fix it though.

 On Mon, Feb 24, 2020, 12:18 PM Kyle Weaver 
 wrote:

> I wonder if searching for "apache beam" (instead of "apache/beam")
> will ever work on Docker hub once the new images start getting more
> downloads. Currently no relevant results are included. Unfortunately 
> this
> might not be something we can control, as it seems Docker hub's 
> search is
> not as sophisticated as I would like.
>
> Anyway, this is still a great improvement. Thanks Hannah for
> making this happen.
>
> On Fri, Feb 21, 2020 at 11:58 AM Ahmet Altay 
> wrote:
>
>> Thank you, Hannah! This is great.
>>
>> On Fri, Feb 21, 2020 at 11:24 AM Hannah Jiang <
>> hannahji...@google.com> wrote:
>>
>>> Hello team
>>>
>>> Docker SDK images (Python, Java, Go) and Flink job server images
>>> are migrated to Apache org[1].
>>> I confirmed digests of all the images of the two repos are
>>> exactly the same.
>>> In addition, I updated readme pages of the new repos. The readma
>>> pages match to the ones at github.
>>>
>>> New images will be deployed to Apache org from v2.20.0. I added
>>> notices to the original repos[2] about the changes.
>>> Spark job server images will be added from v2.20 as well, thanks
>>> for @Kyle Weaver  to make it happen[3].
>>>
>>> Thanks,
>>> Hannah
>>>
>>> 1. https://hub.docker.com/search?q=apache%2Fbeam&type=image
>>> 2. https://hub.docker.com/search?q=apachebeam/&type=image
>>> 3. https://github.com/apache/beam/pull/10921
>>>
>>


Re: GroupIntoBatches not Working properly for Direct Runner Java

2020-02-28 Thread Vasu Gupta
Edit: Issue is on Direct Runner(Not Direction Runner - mistyped)
Issue Details: 
Input data: 7 key-value Packets like: a-1, a-4, b-3, c-5, d-1, e-4, e-5
Batch Size: 5
Expected output: a-1,4, b-3, c-5, d-1, e-4,5
Getting Packets with irregular size like a-1, b-5, e-4,5 OR a-1,4, c-5 etc
But i always got correct number of packets with BATCH_SIZE = 1

On 2020/02/27 20:40:16, Kenneth Knowles  wrote: 
> Can you share some more details? What is the expected output and what
> output are you seeing?
> 
> On Thu, Feb 27, 2020 at 9:39 AM Vasu Gupta  wrote:
> 
> > Hey folks, I am using Apache beam Framework in Java with Direction Runner
> > for local testing purposes. When using GroupIntoBatches with batch size 1
> > it works perfectly fine i.e. the output of the transform is consistent and
> > as expected. But when using with batch size > 1 the output Pcollection has
> > less data than it should be.
> >
> > Pipeline flow:
> > 1. A Transform for reading from pubsub
> > 2. Transform for making a KV out of the data
> > 3. A Fixed Window transform of 1 second
> > 4. Applying GroupIntoBatches transform
> > 5. And last, Logging the resulting Iterables.
> >
> > Weird thing is that it batch_size > 1 works great when running on
> > DataflowRunner but not with DirectRunner. I think the issue might be with
> > Timer Expiry since GroupIntoBatches uses BagState internally.
> >
> > Any help will be much appreciated.
> >
> 


Re: [VOTE] Vendored Dependencies Release Byte Buddy 1.10.8 RC2

2020-02-28 Thread Ismaël Mejía
+1 (binding)

On Wed, Feb 26, 2020 at 10:28 PM Robert Bradshaw 
wrote:

> +1 (binding)
>
> On Wed, Feb 26, 2020 at 1:11 PM Pablo Estrada  wrote:
> >
> > +1 (binding)
> > Verified hashes.
> > Thank you Ismael!
> >
> > On Wed, Feb 26, 2020 at 11:30 AM Luke Cwik  wrote:
> >>
> >> +1 (binding)
> >>
> >> Verified signatures and contents of jar to not contain module-info.class
> >>
> >> On Wed, Feb 26, 2020 at 10:45 AM Kai Jiang  wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> On Wed, Feb 26, 2020 at 01:23 Ismaël Mejía  wrote:
> 
>  Please review the release of the following artifacts that we vendor:
>   * beam-vendor-bytebuddy-1_10_8
> 
>  Hi everyone,
>  Please review and vote on the release candidate #2 for the version
> 0.1, as follows:
>  [ ] +1, Approve the release
>  [ ] -1, Do not approve the release (please provide specific comments)
> 
>  The complete staging area is available for your review, which
> includes:
>  * the official Apache source release to be deployed to
> dist.apache.org [1], which is signed with the key with fingerprint
> 3415631729E15B33051ADB670A9DAF6713B86349 [2],
>  * all artifacts to be deployed to the Maven Central Repository [3],
>  * commit hash "63492776f154464f67533a6059f162e6b8cf7315" [4],
> 
>  The vote will be open for at least 72 hours. It is adopted by
> majority approval, with at least 3 PMC affirmative votes.
> 
>  Thanks,
>  Release Manager
> 
>  [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>  [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>  [3]
> https://repository.apache.org/content/repositories/orgapachebeam-1094/
>  [4]
> https://github.com/apache/beam/commit/63492776f154464f67533a6059f162e6b8cf7315
> 
>


Re: Issue with KafkaIO for list of topics

2020-02-28 Thread rahul patwari
Hi Maulik,

Currently, I don't think it is possible to filter topics based on whether
data is being produced to the topic (or) not.
But, the Watermark logic can be changed to make the Pipeline work.

Since the timestamps of the records are the time when the events are pushed
to Kafka, every record will have monotonically increasing timestamps except
for out of order events.
Instead of assigning the Watermark as BoundedWindow.TIMESTAMP_MIN_VALUE by
default, we can assign [current_timestamp - some_delay] as default and the
same can be done in getWatermark() method, in which case, even if the
partition is idle, Watermark will advance.

Make sure that the timestamp of the Watermark is monotonically increasing
and choose the delay carefully in order to avoid discarding out of order
events.

Refer
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/CustomTimestampPolicyWithLimitedDelay.java
for an example.

Regards,
Rahul


On Fri, Feb 28, 2020 at 6:54 PM Maulik Soneji 
wrote:

> Hi Rahul,
>
> Thank you very much for the detailed explanation.
>
> Since we don't know which are the topics that have zero throughputs, is
> there a way in which we can filter out such topics in KafkaIO?
>
> Since KafkaIO doesn't support passing a regex to consume data from, I am
> getting a list of topics from kafka and passing it.
>
> Is there a way to filter out such topics? Also, it can happen that when
> the job has started the topic might have no data for a few windows and
> after that, it can get some data. This filter should be dynamic as well.
>
> Please share some ideas on how we can make this work.
>
> Community members, please share your thoughts as well on how we can
> achieve this.
>
> Thanks and regards,
> Maulik
>
> On Fri, Feb 28, 2020 at 3:03 PM rahul patwari 
> wrote:
>
>> Hi Maulik,
>>
>> This seems like an issue with Watermark.
>> According to
>> https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L240
>> ,
>>
>> If there are multiple partitions (or) multiple topics, Watermark will be
>> calculated for each of the partition and the minimum watermark is
>> considered as the current Watermark.
>> Assuming that no message is pushed to the topic with 0 throughput,
>> according to your logic for the watermark calculation, the watermark of
>> each partition for this topic will be BoundedWindow.TIMESTAMP_MIN_VALUE
>> (the smallest representable timestamp of an element -
>> https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/model/pipeline/src/main/proto/beam_runner_api.proto#L44
>> ).
>>
>> As the result will be emitted from GroupByKey when the Watermark crosses
>> the window and as the watermark is BoundedWindow.TIMESTAMP_MIN_VALUE,
>> you are not seeing the results from GroupByKey.
>>
>> Regards,
>> Rahul
>>
>> On Fri, Feb 28, 2020 at 12:39 PM Maulik Soneji 
>> wrote:
>>
>>> *Observations:*
>>> If we read using KafkaIO for a list of topics where one of the topics
>>> has zero throughputs,
>>> and KafkaIO is followed by GroupByKey stage, then:
>>> a. No data is output from GroupByKey stage for all the topics and not
>>> just the zero throughput topic.
>>>
>>> If all topics have some throughput coming in, then it works fine and we
>>> get some output from GroupByKey stage.
>>>
>>> Is this an issue?
>>>
>>> *Points:*
>>> a. The output from GroupByKey is only when all topics have some
>>> throughput
>>> b. This is a problem with KafkaIO + GroupByKey, for case where I have
>>> FileIO + GroupByKey, this issue doesn't arise. GroupByKey outputs some data
>>> even if there is no data for one of the files.
>>> c. Not a runner issue, since I ran it with FlinkRunner and DataflowRunner
>>> d. Even if lag is different for each topic on the list, we still get
>>> some output from GroupByKey.
>>>
>>>
>>> *Debugging:*While Debugging this issue I found that in split function
>>> of KafkaUnboundedSource we create KafkaUnboundedSource where partition list
>>> is one partition for each topic.
>>>
>>> I am not sure if this is some issue with watermark, since watermark for
>>> the topic with no throughput will not advance. But this looks like the most
>>> likely cause to me.
>>>
>>> *Please help me in figuring out whether this is an issue or if there is
>>> something wrong with my pipeline.*
>>>
>>> Attaching detailed pipeline information for more details:
>>>
>>> *Context:*
>>> I am currently using KafkaIO to read data from kafka for a list of
>>> topics with a custom timestamp policy.
>>>
>>> Below is how I am constructing KafkaIO reader:
>>>
>>> return KafkaIO.read()
>>> .withBootstrapServers(brokers)
>>> .withTopics(topics)
>>> .withKeyDeserializer(ByteArrayDeserializer.class)
>>> .withValueDeserializer(ByteArrayDeserializer.class)
>>> .withTimestampPolicyFactory((partition, previousWatermark) -> new 
>>> EventT

Re: Java SplittableDoFn Watermark API

2020-02-28 Thread Ismaël Mejía
I just realized that the HBaseIO example is not a good one because we can
already have Watch like behavior as we do for Partition discovery in
HCatalogIO.
Still I am interested on your views on bounded/unbounded unification.

Interesting question2: How this will annotations connect with the Watch
transform Polling patterns?
https://github.com/apache/beam/blob/650e6cd9c707472c34055382a1356cf22d14ee5e/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L178


On Fri, Feb 28, 2020 at 10:47 AM Ismaël Mejía  wrote:

> Really interesting! Implementing correctly the watermark has been a common
> struggle for IO authors, to the point that some IOs still have issues
> around
> that. So +1 for this, in particular if we can get to reuse common patterns.
> I was not aware of Boyuan's work around this, really nice.
>
> One aspect I have always being confused about since I read the SDF proposal
> documents is if we could get to have a single API for both Bounded and
> Unbounded
> IO by somehow assuming that with a BoundedSDF is an UnboundedSDF special
> case.
> Could WatermarkEstimator help in this direction?
>
> One quick case that I can think is to make the current HBaseIO SDF to work
> in an
> unbounded manner, for example to 'watch and read new tables'.
>
>
> On Thu, Feb 27, 2020 at 11:43 PM Luke Cwik  wrote:
>
>> See this doc[1] and blog[2] for some context about SplittableDoFns.
>>
>> To support watermark reporting within the Java SDK for SplittableDoFns,
>> we need a way to have SDF authors to report watermark estimates over the
>> element and restriction pair that they are processing.
>>
>> For UnboundedSources, it was found to be a pain point to ask each SDF
>> author to write their own watermark estimation which typically prevented
>> re-use. Therefore we would like to have a "library" of watermark estimators
>> that help SDF authors perform this estimation similar to how there is a
>> "library" of restrictions and restriction trackers that SDF authors can
>> use. For SDF authors where the existing library doesn't work, they can add
>> additional ones that observe timestamps of elements or choose to directly
>> report the watermark through a "ManualWatermarkEstimator" parameter that
>> can be supplied to @ProcessElement methods.
>>
>> The public facing portion of the DoFn changes adds three new annotations
>> for new DoFn style methods:
>> GetInitialWatermarkEstimatorState: Returns the initial watermark state,
>> similar to GetInitialRestriction
>> GetWatermarkEstimatorStateCoder: Returns a coder compatible with
>> watermark state type, similar to GetRestrictionCoder for restrictions
>> returned by GetInitialRestriction.
>> NewWatermarkEstimator: Returns a watermark estimator that either the
>> framework invokes allowing it to observe the timestamps of output records
>> or a manual watermark estimator that can be explicitly invoked to update
>> the watermark.
>>
>> See [3] for an initial PR with the public facing additions to the core
>> Java API related to SplittableDoFn.
>>
>> This mirrors a bunch of work that was done by Boyuan within the Pyhon SDK
>> [4, 5] but in the style of new DoFn parameter/method invocation we have in
>> the Java SDK.
>>
>> 1: https://s.apache.org/splittable-do-fn
>> 2: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html
>> 3: https://github.com/apache/beam/pull/10992
>> 4: https://github.com/apache/beam/pull/9794
>> 5: https://github.com/apache/beam/pull/10375
>>
>


Re: Are there extra Beam Python test matchers available?

2020-02-28 Thread Kamil Wasilewski
Hi,

You can use matchers from hamcrest module. For example, assuming your
pcollection consists of a single list, you can use something like this to
test if it contains a subset:

import hamcrest as hc
assert_that(pcoll, matches_all([hc.has_items(1, 3, 6, 10)]))

Thanks,
Kamil

On Tue, Feb 25, 2020 at 7:24 PM Liu Wang  wrote:

> Hi,
>
> As far as I know, the matchers I can use with assert_that are
> equal_to and matches_all in this file:
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/util.py
>
> Is there any matcher that I can test a PCollection contains a subset?
> Is there any matcher that I can test a PCollection contains any number of
> a specific element?
>
> For example, I want to test the PCollection [1, 3, 6, 10, 15, 15, 15]
> contains [1, 3, 6, 10], and it contains any number of 15s.
>
> Thanks,
> Liu
>


Re: Issue with KafkaIO for list of topics

2020-02-28 Thread Maulik Soneji
Hi Rahul,

Thank you very much for the detailed explanation.

Since we don't know which are the topics that have zero throughputs, is
there a way in which we can filter out such topics in KafkaIO?

Since KafkaIO doesn't support passing a regex to consume data from, I am
getting a list of topics from kafka and passing it.

Is there a way to filter out such topics? Also, it can happen that when the
job has started the topic might have no data for a few windows and after
that, it can get some data. This filter should be dynamic as well.

Please share some ideas on how we can make this work.

Community members, please share your thoughts as well on how we can achieve
this.

Thanks and regards,
Maulik

On Fri, Feb 28, 2020 at 3:03 PM rahul patwari 
wrote:

> Hi Maulik,
>
> This seems like an issue with Watermark.
> According to
> https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L240
> ,
>
> If there are multiple partitions (or) multiple topics, Watermark will be
> calculated for each of the partition and the minimum watermark is
> considered as the current Watermark.
> Assuming that no message is pushed to the topic with 0 throughput,
> according to your logic for the watermark calculation, the watermark of
> each partition for this topic will be BoundedWindow.TIMESTAMP_MIN_VALUE
> (the smallest representable timestamp of an element -
> https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/model/pipeline/src/main/proto/beam_runner_api.proto#L44
> ).
>
> As the result will be emitted from GroupByKey when the Watermark crosses
> the window and as the watermark is BoundedWindow.TIMESTAMP_MIN_VALUE, you
> are not seeing the results from GroupByKey.
>
> Regards,
> Rahul
>
> On Fri, Feb 28, 2020 at 12:39 PM Maulik Soneji 
> wrote:
>
>> *Observations:*
>> If we read using KafkaIO for a list of topics where one of the topics has
>> zero throughputs,
>> and KafkaIO is followed by GroupByKey stage, then:
>> a. No data is output from GroupByKey stage for all the topics and not
>> just the zero throughput topic.
>>
>> If all topics have some throughput coming in, then it works fine and we
>> get some output from GroupByKey stage.
>>
>> Is this an issue?
>>
>> *Points:*
>> a. The output from GroupByKey is only when all topics have some throughput
>> b. This is a problem with KafkaIO + GroupByKey, for case where I have
>> FileIO + GroupByKey, this issue doesn't arise. GroupByKey outputs some data
>> even if there is no data for one of the files.
>> c. Not a runner issue, since I ran it with FlinkRunner and DataflowRunner
>> d. Even if lag is different for each topic on the list, we still get some
>> output from GroupByKey.
>>
>>
>> *Debugging:*While Debugging this issue I found that in split function of
>> KafkaUnboundedSource we create KafkaUnboundedSource where partition list is
>> one partition for each topic.
>>
>> I am not sure if this is some issue with watermark, since watermark for
>> the topic with no throughput will not advance. But this looks like the most
>> likely cause to me.
>>
>> *Please help me in figuring out whether this is an issue or if there is
>> something wrong with my pipeline.*
>>
>> Attaching detailed pipeline information for more details:
>>
>> *Context:*
>> I am currently using KafkaIO to read data from kafka for a list of topics
>> with a custom timestamp policy.
>>
>> Below is how I am constructing KafkaIO reader:
>>
>> return KafkaIO.read()
>> .withBootstrapServers(brokers)
>> .withTopics(topics)
>> .withKeyDeserializer(ByteArrayDeserializer.class)
>> .withValueDeserializer(ByteArrayDeserializer.class)
>> .withTimestampPolicyFactory((partition, previousWatermark) -> new 
>> EventTimestampPolicy(godataService, previousWatermark))
>> .commitOffsetsInFinalize();
>>
>> *Pipeline Information:
>> *Pipeline Consists of six steps:
>> a. Read From Kafka with custom timestamp policy
>> b. Convert KafkaRecord to Message object
>> c. Window based on FixedWindow of 10 minutes triggering AfterWatermark
>> d. PCollection to PCollection> where Topic is 
>> Keye. GroupByKey.create() to get PCollection>f. 
>> PCollection> to PCollection 
>> for each topicg. Write output to kafka
>>
>> *Detailed Pipeline Information*
>> a. Read data from kafka to get KafkaRecord
>> Here I am using my own timestamp policy which looks like below:
>>
>> public EventTimestampPolicy(MyService myService, Optional 
>> previousWatermark) {
>> this.myService = myService;
>> this.currentWatermark = 
>> previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
>> }
>>
>> @Override
>> public Instant getTimestampForRecord(PartitionContext context, 
>> KafkaRecord record) {
>> Instant eventTimestamp;
>> try {
>> eventTimestamp = Deserializer.getEventTimestamp(record, myService);
>> } catch (InvalidProtocolBufferException e) {
>>

Re: Java SplittableDoFn Watermark API

2020-02-28 Thread Ismaël Mejía
Really interesting! Implementing correctly the watermark has been a common
struggle for IO authors, to the point that some IOs still have issues around
that. So +1 for this, in particular if we can get to reuse common patterns.
I was not aware of Boyuan's work around this, really nice.

One aspect I have always being confused about since I read the SDF proposal
documents is if we could get to have a single API for both Bounded and
Unbounded
IO by somehow assuming that with a BoundedSDF is an UnboundedSDF special
case.
Could WatermarkEstimator help in this direction?

One quick case that I can think is to make the current HBaseIO SDF to work
in an
unbounded manner, for example to 'watch and read new tables'.


On Thu, Feb 27, 2020 at 11:43 PM Luke Cwik  wrote:

> See this doc[1] and blog[2] for some context about SplittableDoFns.
>
> To support watermark reporting within the Java SDK for SplittableDoFns, we
> need a way to have SDF authors to report watermark estimates over the
> element and restriction pair that they are processing.
>
> For UnboundedSources, it was found to be a pain point to ask each SDF
> author to write their own watermark estimation which typically prevented
> re-use. Therefore we would like to have a "library" of watermark estimators
> that help SDF authors perform this estimation similar to how there is a
> "library" of restrictions and restriction trackers that SDF authors can
> use. For SDF authors where the existing library doesn't work, they can add
> additional ones that observe timestamps of elements or choose to directly
> report the watermark through a "ManualWatermarkEstimator" parameter that
> can be supplied to @ProcessElement methods.
>
> The public facing portion of the DoFn changes adds three new annotations
> for new DoFn style methods:
> GetInitialWatermarkEstimatorState: Returns the initial watermark state,
> similar to GetInitialRestriction
> GetWatermarkEstimatorStateCoder: Returns a coder compatible with watermark
> state type, similar to GetRestrictionCoder for restrictions returned by
> GetInitialRestriction.
> NewWatermarkEstimator: Returns a watermark estimator that either the
> framework invokes allowing it to observe the timestamps of output records
> or a manual watermark estimator that can be explicitly invoked to update
> the watermark.
>
> See [3] for an initial PR with the public facing additions to the core
> Java API related to SplittableDoFn.
>
> This mirrors a bunch of work that was done by Boyuan within the Pyhon SDK
> [4, 5] but in the style of new DoFn parameter/method invocation we have in
> the Java SDK.
>
> 1: https://s.apache.org/splittable-do-fn
> 2: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html
> 3: https://github.com/apache/beam/pull/10992
> 4: https://github.com/apache/beam/pull/9794
> 5: https://github.com/apache/beam/pull/10375
>


Re: Issue with KafkaIO for list of topics

2020-02-28 Thread rahul patwari
Hi Maulik,

This seems like an issue with Watermark.
According to
https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L240
,

If there are multiple partitions (or) multiple topics, Watermark will be
calculated for each of the partition and the minimum watermark is
considered as the current Watermark.
Assuming that no message is pushed to the topic with 0 throughput,
according to your logic for the watermark calculation, the watermark of
each partition for this topic will be BoundedWindow.TIMESTAMP_MIN_VALUE
(the smallest representable timestamp of an element -
https://github.com/apache/beam/blob/f0930f958d47042948c06041e074ef9f1b0872d9/model/pipeline/src/main/proto/beam_runner_api.proto#L44
).

As the result will be emitted from GroupByKey when the Watermark crosses
the window and as the watermark is BoundedWindow.TIMESTAMP_MIN_VALUE, you
are not seeing the results from GroupByKey.

Regards,
Rahul

On Fri, Feb 28, 2020 at 12:39 PM Maulik Soneji 
wrote:

> *Observations:*
> If we read using KafkaIO for a list of topics where one of the topics has
> zero throughputs,
> and KafkaIO is followed by GroupByKey stage, then:
> a. No data is output from GroupByKey stage for all the topics and not just
> the zero throughput topic.
>
> If all topics have some throughput coming in, then it works fine and we
> get some output from GroupByKey stage.
>
> Is this an issue?
>
> *Points:*
> a. The output from GroupByKey is only when all topics have some throughput
> b. This is a problem with KafkaIO + GroupByKey, for case where I have
> FileIO + GroupByKey, this issue doesn't arise. GroupByKey outputs some data
> even if there is no data for one of the files.
> c. Not a runner issue, since I ran it with FlinkRunner and DataflowRunner
> d. Even if lag is different for each topic on the list, we still get some
> output from GroupByKey.
>
>
> *Debugging:*While Debugging this issue I found that in split function of
> KafkaUnboundedSource we create KafkaUnboundedSource where partition list is
> one partition for each topic.
>
> I am not sure if this is some issue with watermark, since watermark for
> the topic with no throughput will not advance. But this looks like the most
> likely cause to me.
>
> *Please help me in figuring out whether this is an issue or if there is
> something wrong with my pipeline.*
>
> Attaching detailed pipeline information for more details:
>
> *Context:*
> I am currently using KafkaIO to read data from kafka for a list of topics
> with a custom timestamp policy.
>
> Below is how I am constructing KafkaIO reader:
>
> return KafkaIO.read()
> .withBootstrapServers(brokers)
> .withTopics(topics)
> .withKeyDeserializer(ByteArrayDeserializer.class)
> .withValueDeserializer(ByteArrayDeserializer.class)
> .withTimestampPolicyFactory((partition, previousWatermark) -> new 
> EventTimestampPolicy(godataService, previousWatermark))
> .commitOffsetsInFinalize();
>
> *Pipeline Information:
> *Pipeline Consists of six steps:
> a. Read From Kafka with custom timestamp policy
> b. Convert KafkaRecord to Message object
> c. Window based on FixedWindow of 10 minutes triggering AfterWatermark
> d. PCollection to PCollection> where Topic is 
> Keye. GroupByKey.create() to get PCollection>f. 
> PCollection> to PCollection for 
> each topicg. Write output to kafka
>
> *Detailed Pipeline Information*
> a. Read data from kafka to get KafkaRecord
> Here I am using my own timestamp policy which looks like below:
>
> public EventTimestampPolicy(MyService myService, Optional 
> previousWatermark) {
> this.myService = myService;
> this.currentWatermark = 
> previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
> }
>
> @Override
> public Instant getTimestampForRecord(PartitionContext context, 
> KafkaRecord record) {
> Instant eventTimestamp;
> try {
> eventTimestamp = Deserializer.getEventTimestamp(record, myService);
> } catch (InvalidProtocolBufferException e) {
> statsClient.increment("io.proto.buffer.exception");
> throw new RuntimeException(e);
> }
> this.currentWatermark = eventTimestamp;
> return this.currentWatermark;
> }
>
> @Override
> public Instant getWatermark(PartitionContext ctx) {
> return this.currentWatermark;
> }
>
> Event timestamp is one of the fields in the kafka message. It is the time
> when the event was pushed to kafka.
>
> b. DoFn to transform KafkaRecord to Message class.The Message 
> class contains properties like offset, topic, partition, offset and timestamp
>
> c. Windowing on 10 minute fixed window triggering at 
> AfterWatermark.pastEndOfWindow()
>
> d. PCollection to PCollection>
> Here Key is the kafka topic.
>
> e. GroupByKey to get PCollection>
>
> f. PCollection> to PCollection 
> for each topic
>
> g. Write output to kafka
>
>


Re: [DISCUSS] How many Python 3.x minor versions should Beam Python SDK aim to support concurrently?

2020-02-28 Thread Ismaël Mejía
One interesting variable that has not being mentioned is what versions of
python
3 are available to users via their distribution channels (the linux
distributions they use to develop/run the pipelines).

- RHEL 8 users have python 3.6 available
- RHEL 7 users have python 3.6 available
- Debian 10/Ubuntu 18.04 users have python 3.7/3.6 available
- Debian 9/Ubuntu 16.04 users have python 3.5 available

We should consider this when we evaluate future support removals.

Given  that the distros that support python 3.5 are ~4y old and since
python 3.5
is also losing LTS support soon is probably ok to not support it in Beam
anymore as Robert suggests.


On Thu, Feb 27, 2020 at 3:57 AM Valentyn Tymofieiev 
wrote:

> Thanks everyone for sharing your perspectives so far. It sounds like we
> can mitigate the cost of test infrastructure by having:
> - a selection of (fast) tests that we will want to run against all Python
> versions we support.
> - high priority Python versions, which we will test extensively.
> - infrequent postcommit test that exercise low-priority versions.
> We will need test infrastructure improvements to have the flexibility of
> designating versions of high-pri/low-pri and minimizing efforts requiring
> adopting a new version.
>
> There is still a question of how long we want to support old Py3.x
> versions. As mentioned above, I think we should not support them beyond EOL
> (5 years after a release). I wonder if that is still too long. The cost of
> supporting a version may include:
>  - Developing against older Python version
>  - Release overhead (building & storing containers, wheels, doing release
> validation)
>  - Complexity / development cost to support the quirks of the minor
> versions.
>
> We can decide to drop support, after, say, 4 years, or after usage drops
> below a threshold, or decide on a case-by-case basis. Thoughts? Also asked
> for feedback on user@ [1]
>
> [1]
> https://lists.apache.org/thread.html/r630a3b55aa8e75c68c8252ea6f824c3ab231ad56e18d916dfb84d9e8%40%3Cuser.beam.apache.org%3E
>
> On Wed, Feb 26, 2020 at 5:27 PM Robert Bradshaw 
> wrote:
>
>> On Wed, Feb 26, 2020 at 5:21 PM Valentyn Tymofieiev 
>> wrote:
>> >
>> > > +1 to consulting users.
>> > I will message user@ as well and point to this thread.
>> >
>> > > I would propose getting in warnings about 3.5 EoL well ahead of time.
>> > I think we should document on our website, and  in the code (warnings)
>> that users should not expect SDKs to be supported in Beam beyond the EOL.
>> If we want to have flexibility to drop support earlier than EOL, we need to
>> be more careful with messaging because users might otherwise expect that
>> support will last until EOL, if we mention EOL date.
>>
>> +1
>>
>> > I am hoping that we can establish a consensus for when we will be
>> dropping support for a version, so that we don't have to discuss it on a
>> case by case basis in the future.
>> >
>> > > I think it would makes sense to add support for 3.8 right away (or at
>> least get a good sense of what work needs to be done and what our
>> dependency situation is like)
>> > https://issues.apache.org/jira/browse/BEAM-8494 is a starting point. I
>> tried 3.8 a while ago some dependencies were not able to install, checked
>> again just now. SDK is "installable" after minor changes. Some tests don't
>> pass. BEAM-8494 does not have an owner atm, and if anyone is interested I'm
>> happy to give further pointers and help get started.
>> >
>> > > For the 3.x series, I think we will get the most signal out of the
>> lowest and highest version, and can get by with smoke tests +
>> > infrequent post-commits for the ones between.
>> >
>> > > I agree with having low-frequency tests for low-priority versions.
>> Low-priority versions could be determined according to least usage.
>> >
>> > These are good ideas. Do you think we will want to have an ability  to
>> run some (inexpensive) tests for all versions  frequently (on presubmits),
>> or this is extra complexity that can be avoided? I am thinking about type
>> inference for example. Afaik inference logic is very sensitive to the
>> version. Would it be acceptable to catch  errors there in infrequent
>> postcommits or an early signal will be preferred?
>>
>> This is a good example--the type inference tests are sensitive to
>> version (due to using internal details and relying on the
>> still-evolving typing module) but also run in ~15 seconds. I think
>> these should be in precommits. We just don't need to run every test
>> for every version.
>>
>> > On Wed, Feb 26, 2020 at 5:17 PM Kyle Weaver 
>> wrote:
>> >>
>> >> Oh, I didn't see Robert's earlier email:
>> >>
>> >> > Currently 3.5 downloads sit at 3.7%, or about
>> >> > 20% of all Python 3 downloads.
>> >>
>> >> Where did these numbers come from?
>> >>
>> >> On Wed, Feb 26, 2020 at 5:15 PM Kyle Weaver 
>> wrote:
>> >>>
>> >>> > I agree with having low-frequency tests for low-priority versions.
>> >>> > Low-priority versions could be 

Re: Java SplittableDoFn Watermark API

2020-02-28 Thread Jan Lukavský
This is cool. Could the watermark estimators be packaged in a module 
without additional dependencies? I think that it would be useful even 
for projects outside of Bbeam, so it would be nice if these could use 
this library without depending on Beam SDK itself.


Jan

On 2/28/20 12:50 AM, Luke Cwik wrote:
Python SDK also has a RestrictionProvider[1], that covers initial 
splitting, sizing the restriction and providing the restriction coder.
I believe that keeping one as a provider while fully integrating the 
other set as "new" DoFn style methods and parameters would be odd.


Kenn are you also suggesting swapping all the restriction tracker 
related methods (e.g. NewTracker, GetRestrictionCoder, ...) to the 
Provider style as well?


I like how the new DoFn style methods and parameters is done because 
it is easily extensible. This could be extended for 
RestrictionProvider/WatermarkEstimatorProvider via new invokers in the 
style of the DoFnInvoker.


1: 
https://github.com/apache/beam/blob/7a4cdece44304acb77a1812390bf35f66f3df0a2/sdks/python/apache_beam/transforms/core.py#L208


On Thu, Feb 27, 2020 at 3:39 PM Kenneth Knowles > wrote:


Great idea.

Are any of the methods optional or useful on their own? It seems
like maybe not? So then a single annotation to return an object
that returns all the methods might be more clear. Per Boyuan's
work - WatermarkEstimatorProvider?

Kenn

On Thu, Feb 27, 2020 at 2:43 PM Luke Cwik mailto:lc...@google.com>> wrote:

See this doc[1] and blog[2] for some context about
SplittableDoFns.

To support watermark reporting within the Java SDK for
SplittableDoFns, we need a way to have SDF authors to report
watermark estimates over the element and restriction pair that
they are processing.

For UnboundedSources, it was found to be a pain point to ask
each SDF author to write their own watermark estimation which
typically prevented re-use. Therefore we would like to have a
"library" of watermark estimators that help SDF authors
perform this estimation similar to how there is a "library" of
restrictions and restriction trackers that SDF authors can
use. For SDF authors where the existing library doesn't work,
they can add additional ones that observe timestamps of
elements or choose to directly report the watermark through a
"ManualWatermarkEstimator" parameter that can be supplied to
@ProcessElement methods.

The public facing portion of the DoFn changes adds three new
annotations for new DoFn style methods:
GetInitialWatermarkEstimatorState: Returns the initial
watermark state, similar to GetInitialRestriction
GetWatermarkEstimatorStateCoder: Returns a coder compatible
with watermark state type, similar to GetRestrictionCoder for
restrictions returned by GetInitialRestriction.
NewWatermarkEstimator: Returns a watermark estimator that
either the framework invokes allowing it to observe the
timestamps of output records or a manual watermark estimator
that can be explicitly invoked to update the watermark.

See [3] for an initial PR with the public facing additions to
the core Java API related to SplittableDoFn.

This mirrors a bunch of work that was done by Boyuan within
the Pyhon SDK [4, 5] but in the style of new DoFn
parameter/method invocation we have in the Java SDK.

1: https://s.apache.org/splittable-do-fn
2: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html
3: https://github.com/apache/beam/pull/10992
4: https://github.com/apache/beam/pull/9794
5: https://github.com/apache/beam/pull/10375