Re: Builtin IOs - Link to Java/Pydoc instead of code?

2020-05-05 Thread Chamikara Jayalath
+1 for linking to the more readable version.

- Cham

On Tue, May 5, 2020, 9:26 AM Brian Hulette  wrote:

> +1 for doc links instead of code. I think from a user perspective the code
> link is effectively the same as javadoc/pydoc since they'll just peruse the
> docstrings, except it's harder to read and reflects the behavior at HEAD,
> not at any release.
>
> Brian
>
> On Mon, May 4, 2020 at 6:21 PM Pablo Estrada  wrote:
>
>> Hi all,
>> I just noted that in our Built-in IOs page[1], we tend to link to the
>> code for the IOs that we mention.
>>
>> I think it would be better to link to the Javadoc or the Pydoc for those
>> IOs instead. Thoughts?
>> Best
>> -P.
>>
>> [1] https://beam.apache.org/documentation/io/built-in/
>>
>


Re: [DISCUSS] How many Python 3.x minor versions should Beam Python SDK aim to support concurrently?

2020-05-05 Thread Yoshiki Obata
> Not sure how run_pylint.sh is related here - we should run linter on the 
> entire codebase.
ah, I mistyped... I meant run_pytest.sh

> I am familiar with beam_PostCommit_PythonXX suites. Is there something 
> specific about these suites that you wanted to know?
Test suite runtime will depend on the number of  tests in the suite,
how many tests we run in parallel, how long they take to run. To
understand the load on test infrastructure we can monitor Beam test
health metrics [1]. In particular, if time in queue[2] is high, it is
a sign that there are not enough Jenkins slots available to start the
test suite earlier.
Sorry for ambiguous question. I wanted to know how to see the load on
test infrastructure.
The Grafana links you showed serves my purpose. Thank you.

2020年5月6日(水) 2:35 Valentyn Tymofieiev :
>
> On Mon, May 4, 2020 at 7:06 PM Yoshiki Obata  wrote:
>>
>> Thank you for comment, Valentyn.
>>
>> > 1) We can seed the smoke test suite with typehints tests, and add more 
>> > tests later if there is a need. We can identify them by the file path or 
>> > by special attributes in test files. Identifying them using filepath seems 
>> > simpler and independent of test runner.
>>
>> Yes, making run_pylint.sh allow target test file paths as arguments is
>> good way if could.
>
>
> Not sure how run_pylint.sh is related here - we should run linter on the 
> entire codebase.
>
>>
>> > 3)  We should reduce the code duplication across  
>> > beam/sdks/python/test-suites/$runner/py3*. I think we could move the suite 
>> > definition into a common file like 
>> > beam/sdks/python/test-suites/$runner/build.gradle perhaps, and populate 
>> > individual suites (beam/sdks/python/test-suites/$runner/py38/build.gradle) 
>> > including the common file and/or logic from PythonNature [1].
>>
>> Exactly. I'll check it out.
>>
>> > 4) We have some tests that we run only under specific Python 3 versions, 
>> > for example: FlinkValidatesRunner test runs using Python 3.5: [2]
>> > HDFS Python 3 tests are running only with Python 3.7 [3]. Cross-language 
>> > Py3 tests for Spark are running under Python 3.5[4]: , there may be more 
>> > test suites that selectively use particular versions.
>> > We need to correct such suites, so that we do not tie them  to a specific 
>> > Python version. I see several options here: such tests should run either 
>> > for all high-priority versions, or run only under the lowest version among 
>> > the high-priority versions.  We don't have to fix them all at the same 
>> > time. In general, we should try to make it as easy as possible to 
>> > configure, whether a suite runs across all  versions, all high-priority 
>> > versions, or just one version.
>>
>> The way of high-priority/low-priority configuration would be useful for this.
>> And which versions to be tested may be related to 5).
>>
>> > 5) If postcommit suites (that need to run against all versions) still 
>> > constitute too much load on the infrastructure, we may need to investigate 
>> > how to run these suites less frequently.
>>
>> That's certainly true, beam_PostCommit_PythonXX and
>> beam_PostCommit_Python_Chicago_Taxi_(Dataflow|Flink) take about 1
>> hour.
>> Does anyone have knowledge about this?
>
>
> I am familiar with beam_PostCommit_PythonXX suites. Is there something 
> specific about these suites that you wanted to know?
> Test suite runtime will depend on the number of  tests in the suite, how many 
> tests we run in parallel, how long they take to run. To understand the load 
> on test infrastructure we can monitor Beam test health metrics [1]. In 
> particular, if time in queue[2] is high, it is a sign that there are not 
> enough Jenkins slots available to start the test suite earlier.
>
> [1] http://104.154.241.245/d/D81lW0pmk/post-commit-test-reliability
> [2] 
> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1588094891600&to=1588699691600&panelId=6&fullscreen
>
>
>>
>> 2020年5月2日(土) 5:18 Valentyn Tymofieiev :
>> >
>> > Hi Yoshiki,
>> >
>> > Thanks a lot for your help with Python 3 support so far and most recently, 
>> > with your work on Python 3.8.
>> >
>> > Overall the proposal sounds good to me. I see several aspects here that we 
>> > need to address:
>> >
>> > 1) We can seed the smoke test suite with typehints tests, and add more 
>> > tests later if there is a need. We can identify them by the file path or 
>> > by special attributes in test files. Identifying them using filepath seems 
>> > simpler and independent of test runner.
>> >
>> > 2) Defining high priority/low priority versions in gradle.properties 
>> > sounds good to me.
>> >
>> > 3)  We should reduce the code duplication across  
>> > beam/sdks/python/test-suites/$runner/py3*. I think we could move the suite 
>> > definition into a common file like 
>> > beam/sdks/python/test-suites/$runner/build.gradle perhaps, and populate 
>> > individual suites (beam/sdks/python/test-suites/$runner/py38/build.gradle) 
>> > includin

[Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-05 Thread Alex Amato
Hello,

I created another design document. This time for GCP IO Debuggability
Metrics. Which defines some new metrics to collect in the GCP IO libraries.
This is for monitoring request counts and request latencies.

Please take a look and let me know what you think:
https://s.apache.org/beam-gcp-debuggability

I also sent out a separate design yesterday (
https://s.apache.org/beam-histogram-metrics) which is related as this
document uses a Histogram style metric :).

I would love some feedback to make this feature the best possible :D,
Alex


Re: Apache Beam application to Season of Docs 2020

2020-05-05 Thread Pablo Estrada
I think I am willing to help with Project 2.

On Tue, May 5, 2020 at 2:01 PM Aizhamal Nurmamat kyzy 
wrote:

> Thank you, Kyle! really appreciate it! I will add you as a mentor into the
> cwiki page, and let you know if Beam gets accepted to the program on May
> 11th.
>
> On Tue, May 5, 2020 at 12:55 PM Kyle Weaver  wrote:
>
>> Thanks Aizhamal! I would be willing to help with project 1.
>>
>> On Mon, May 4, 2020 at 5:04 PM Aizhamal Nurmamat kyzy <
>> aizha...@apache.org> wrote:
>>
>>> Hi all,
>>>
>>> I have submitted the application to the Season of Docs program with the
>>> project ideas we have developed last year [1]. I learnt about its deadline
>>> a few hours ago and didn't want to miss it.
>>>
>>> Feel free to add more project ideas (or edit the current ones) until May
>>> 7th.
>>>
>>> If Beam gets approved, we will get 1 or 2 experienced technical writers
>>> to help us document community processes or some Beam features. Is anyone
>>> else willing to mentor for these projects?
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs
>>>
>>


Re: Google Summer of Code 2020 [Accepted Proposal]

2020-05-05 Thread Rui Wang
Welcome!


-Rui

On Tue, May 5, 2020 at 1:58 PM Chamikara Jayalath 
wrote:

> Congrats!
>
> On Tue, May 5, 2020 at 1:34 PM Robert Bradshaw 
> wrote:
>
>> Congratulations and welcome!
>>
>> On Tue, May 5, 2020 at 12:46 PM Ismaël Mejía  wrote:
>> >
>> > Hello Aldair,
>> > You were added in JIRA as a contributor.
>> > Welcome to the project!
>> >
>> > On Tue, May 5, 2020 at 8:59 PM Aldair Coronel Ruíz
>> >  wrote:
>> > >
>> > > Hi everyone!
>> > >
>> > > I'm Aldair. My GSoC 2020 project proposal was accepted! It will be my
>> first time working on an open-source project and I'm excited.  My goal is
>> to add support for Azure Blobstore file system in Apache Beam Python SDK.
>> You can find my proposal here [1]
>> > >
>> > > Pablo Estrada and Ismaël Mejía are my mentors and I just want to say
>> thank you in advance. :)
>> > >
>> > > If you have any comments or suggestions feel free to let me know.
>> > >
>> > > One more thing:  Can someone add me as a contributor for Beam's Jira
>> issue tracker? My username is Aldair
>> > >
>> > > Thank you
>> > >
>> > > [1]
>> https://docs.google.com/document/d/1iax7Rkxoe3dxsAdxLwc8bkFGM6axom0foOePyjKI-6Y/edit
>> > > --
>> > > Aldair Coronel Ruiz
>> > >
>>
>


Re: [DISCUSS] finishBundle once per window

2020-05-05 Thread Robert Bradshaw
On Tue, May 5, 2020 at 3:08 PM Reuven Lax  wrote:
>
> On Tue, May 5, 2020 at 2:58 PM Robert Bradshaw  wrote:
>>
>> On Mon, May 4, 2020 at 11:08 AM Reuven Lax  wrote:
>> >
>> > This should not affect the ability of the user to specify the output 
>> > timestamp.
>>
>> My question was whether we would require it.
>
>
> My current PR does not - it defaults to end-of-window as the timestamp. 
> However we could also decide to require it.

I'd be more comfortable requiring it for the time being.

>> On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:
>> >
>> > There was a mention in some other thread, that in order to make user 
>> > experience as predictable as possible, we should try to make windows 
>> > idempotent, and once window is assigned, it should be never changed (and 
>> > timestamp move outside of the scope of window, unless a different windowfn 
>> > is applied). Because all Beam window functions are actually time based, 
>> > and output timestamp is known, what is the issue of applying windowfn to 
>> > elements output from @FinishBundle and assign the windows automatically?
>>
>> We used to do exactly this. (I don't recall why it was removed.) If
>> the input element and/or window was queried by the WindowFn
>> (admittedly rare), it would fail at runtime.
>
> When did we used to do this? We've had users writing WindowFns that queried 
> the input element since long before Beam existed.  e.g a window fn that 
> inspected a userId field, and created different sized windows based on the 
> userId.

This is how it started. In particular WindowFn.AssignContext would be
created that through an exception on accessing the unavailable fields
(which would make finalize bundle unsuitable for such WindowFns).

>> On Tue, May 5, 2020 at 2:51 PM Reuven Lax  wrote:
>> >
>> > It's a good question about startBundle - it's something I thought about. 
>> > The problem is that a runner doesn't always know at startBundle what 
>> > windows are in the bundle, and even if it does know it might require the 
>> > runner to run two passes over the bundle to figure this out. Alternatively 
>> > the runner could keep calling startBundle the first time it's seen a new 
>> > window in the bundle, but I think that makes things even weirder. It's 
>> > also worth noting that startBundle is already more limited today - we do 
>> > not support calling output from startBundle, but we do allow calling 
>> > output from finishBundle.
>> >
>> > Reuven
>> >
>> > On Mon, May 4, 2020 at 11:59 PM Jan Lukavský  wrote:
>> >>
>> >> Ah, interesting. That makes windowFn non-idempotent by definition, 
>> >> because its first application (e.g. global window -> interval window) 
>> >> _might_ yield different result than second application with interval 
>> >> window already assigned. On the other hand, let's suppose for a moment we 
>> >> can make windowFn idempotent, would that solve the issue of window 
>> >> assignment for elements output from finishBundle? I understand that 
>> >> window assignment is not only motivation for adding optional window 
>> >> parameter to @FinishBundle, but users might be confused why 
>> >> OutputReceiver is working only when there is Window parameter. It would 
>> >> be nice to have this somewhat more "consistent". And last note - adding 
>> >> the parameter to @FinishBundle seems a little imbalanced - could this be 
>> >> made possible for @StartBundle as well? Should we enforce that both 
>> >> @StartBundle and @FinishBundle have the same signature, or should we 
>> >> accept all combinations?
>> >>
>> >> Jan
>> >>
>> >> On 5/4/20 11:02 PM, Reuven Lax wrote:
>> >>
>> >> I assume you are referring to elements output from finishBundle.
>> >>
>> >> The problem is that the current window is an input to 
>> >> WindowFn.assignWindows. The new window can depend on the timestamp, the 
>> >> element itself, and the original window. I'm not sure how many users rely 
>> >> on this, however it has been part of our public windowing API for a long 
>> >> time, so I would guess that some users do use this in their WindowFns.
>> >>
>> >> Reuven
>> >>
>> >> On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:
>> >>>
>> >>> There was a mention in some other thread, that in order to make user 
>> >>> experience as predictable as possible, we should try to make windows 
>> >>> idempotent, and once window is assigned, it should be never changed (and 
>> >>> timestamp move outside of the scope of window, unless a different 
>> >>> windowfn is applied). Because all Beam window functions are actually 
>> >>> time based, and output timestamp is known, what is the issue of applying 
>> >>> windowfn to elements output from @FinishBundle and assign the windows 
>> >>> automatically?
>> >>>
>> >>> On 5/4/20 8:07 PM, Reuven Lax wrote:
>> >>>
>> >>> This should not affect the ability of the user to specify the output 
>> >>> timestamp. Today FinishBundleContext.output forces you to specify the 
>> >>> window as well as the timestamp, whi

Re: [DISCUSS] finishBundle once per window

2020-05-05 Thread Reuven Lax
On Tue, May 5, 2020 at 2:58 PM Robert Bradshaw  wrote:

> On Mon, May 4, 2020 at 11:08 AM Reuven Lax  wrote:
> >
> > This should not affect the ability of the user to specify the output
> timestamp.
>
> My question was whether we would require it.
>

My current PR does not - it defaults to end-of-window as the timestamp.
However we could also decide to require it.


>
>
> On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:
> >
> > There was a mention in some other thread, that in order to make user
> experience as predictable as possible, we should try to make windows
> idempotent, and once window is assigned, it should be never changed (and
> timestamp move outside of the scope of window, unless a different windowfn
> is applied). Because all Beam window functions are actually time based, and
> output timestamp is known, what is the issue of applying windowfn to
> elements output from @FinishBundle and assign the windows automatically?
>
> We used to do exactly this. (I don't recall why it was removed.) If
> the input element and/or window was queried by the WindowFn
> (admittedly rare), it would fail at runtime.
>

When did we used to do this? We've had users writing WindowFns that queried
the input element since long before Beam existed.  e.g a window fn that
inspected a userId field, and created different sized windows based on the
userId.


> On Tue, May 5, 2020 at 2:51 PM Reuven Lax  wrote:
> >
> > It's a good question about startBundle - it's something I thought about.
> The problem is that a runner doesn't always know at startBundle what
> windows are in the bundle, and even if it does know it might require the
> runner to run two passes over the bundle to figure this out. Alternatively
> the runner could keep calling startBundle the first time it's seen a new
> window in the bundle, but I think that makes things even weirder. It's also
> worth noting that startBundle is already more limited today - we do not
> support calling output from startBundle, but we do allow calling output
> from finishBundle.
> >
> > Reuven
> >
> > On Mon, May 4, 2020 at 11:59 PM Jan Lukavský  wrote:
> >>
> >> Ah, interesting. That makes windowFn non-idempotent by definition,
> because its first application (e.g. global window -> interval window)
> _might_ yield different result than second application with interval window
> already assigned. On the other hand, let's suppose for a moment we can make
> windowFn idempotent, would that solve the issue of window assignment for
> elements output from finishBundle? I understand that window assignment is
> not only motivation for adding optional window parameter to @FinishBundle,
> but users might be confused why OutputReceiver is working only when there
> is Window parameter. It would be nice to have this somewhat more
> "consistent". And last note - adding the parameter to @FinishBundle seems a
> little imbalanced - could this be made possible for @StartBundle as well?
> Should we enforce that both @StartBundle and @FinishBundle have the same
> signature, or should we accept all combinations?
> >>
> >> Jan
> >>
> >> On 5/4/20 11:02 PM, Reuven Lax wrote:
> >>
> >> I assume you are referring to elements output from finishBundle.
> >>
> >> The problem is that the current window is an input to
> WindowFn.assignWindows. The new window can depend on the timestamp, the
> element itself, and the original window. I'm not sure how many users rely
> on this, however it has been part of our public windowing API for a long
> time, so I would guess that some users do use this in their WindowFns.
> >>
> >> Reuven
> >>
> >> On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:
> >>>
> >>> There was a mention in some other thread, that in order to make user
> experience as predictable as possible, we should try to make windows
> idempotent, and once window is assigned, it should be never changed (and
> timestamp move outside of the scope of window, unless a different windowfn
> is applied). Because all Beam window functions are actually time based, and
> output timestamp is known, what is the issue of applying windowfn to
> elements output from @FinishBundle and assign the windows automatically?
> >>>
> >>> On 5/4/20 8:07 PM, Reuven Lax wrote:
> >>>
> >>> This should not affect the ability of the user to specify the output
> timestamp. Today FinishBundleContext.output forces you to specify the
> window as well as the timestamp, which is a bit awkward. (I believe that it
> also lets you create brand new windows in finishBundle, which is
> interesting, but I'm not quite sure of the use case).
> >>>
> >>> On Mon, May 4, 2020 at 10:29 AM Robert Bradshaw 
> wrote:
> 
>  This is a really nice idea. Would the user still need to specify the
>  timestamp of the output? I'm a bit ambivalent about calling it
>  multiple times if OuptutReceiver alone is in the parameter list; this
>  might not be obvious and could be surprising behavior.
> 
>  On Mon, May 4, 2020 at 10:13 AM Re

Re: [DISCUSS] finishBundle once per window

2020-05-05 Thread Robert Bradshaw
On Mon, May 4, 2020 at 11:08 AM Reuven Lax  wrote:
>
> This should not affect the ability of the user to specify the output 
> timestamp.

My question was whether we would require it.


On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:
>
> There was a mention in some other thread, that in order to make user 
> experience as predictable as possible, we should try to make windows 
> idempotent, and once window is assigned, it should be never changed (and 
> timestamp move outside of the scope of window, unless a different windowfn is 
> applied). Because all Beam window functions are actually time based, and 
> output timestamp is known, what is the issue of applying windowfn to elements 
> output from @FinishBundle and assign the windows automatically?

We used to do exactly this. (I don't recall why it was removed.) If
the input element and/or window was queried by the WindowFn
(admittedly rare), it would fail at runtime.

On Tue, May 5, 2020 at 2:51 PM Reuven Lax  wrote:
>
> It's a good question about startBundle - it's something I thought about. The 
> problem is that a runner doesn't always know at startBundle what windows are 
> in the bundle, and even if it does know it might require the runner to run 
> two passes over the bundle to figure this out. Alternatively the runner could 
> keep calling startBundle the first time it's seen a new window in the bundle, 
> but I think that makes things even weirder. It's also worth noting that 
> startBundle is already more limited today - we do not support calling output 
> from startBundle, but we do allow calling output from finishBundle.
>
> Reuven
>
> On Mon, May 4, 2020 at 11:59 PM Jan Lukavský  wrote:
>>
>> Ah, interesting. That makes windowFn non-idempotent by definition, because 
>> its first application (e.g. global window -> interval window) _might_ yield 
>> different result than second application with interval window already 
>> assigned. On the other hand, let's suppose for a moment we can make windowFn 
>> idempotent, would that solve the issue of window assignment for elements 
>> output from finishBundle? I understand that window assignment is not only 
>> motivation for adding optional window parameter to @FinishBundle, but users 
>> might be confused why OutputReceiver is working only when there is Window 
>> parameter. It would be nice to have this somewhat more "consistent". And 
>> last note - adding the parameter to @FinishBundle seems a little imbalanced 
>> - could this be made possible for @StartBundle as well? Should we enforce 
>> that both @StartBundle and @FinishBundle have the same signature, or should 
>> we accept all combinations?
>>
>> Jan
>>
>> On 5/4/20 11:02 PM, Reuven Lax wrote:
>>
>> I assume you are referring to elements output from finishBundle.
>>
>> The problem is that the current window is an input to 
>> WindowFn.assignWindows. The new window can depend on the timestamp, the 
>> element itself, and the original window. I'm not sure how many users rely on 
>> this, however it has been part of our public windowing API for a long time, 
>> so I would guess that some users do use this in their WindowFns.
>>
>> Reuven
>>
>> On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:
>>>
>>> There was a mention in some other thread, that in order to make user 
>>> experience as predictable as possible, we should try to make windows 
>>> idempotent, and once window is assigned, it should be never changed (and 
>>> timestamp move outside of the scope of window, unless a different windowfn 
>>> is applied). Because all Beam window functions are actually time based, and 
>>> output timestamp is known, what is the issue of applying windowfn to 
>>> elements output from @FinishBundle and assign the windows automatically?
>>>
>>> On 5/4/20 8:07 PM, Reuven Lax wrote:
>>>
>>> This should not affect the ability of the user to specify the output 
>>> timestamp. Today FinishBundleContext.output forces you to specify the 
>>> window as well as the timestamp, which is a bit awkward. (I believe that it 
>>> also lets you create brand new windows in finishBundle, which is 
>>> interesting, but I'm not quite sure of the use case).
>>>
>>> On Mon, May 4, 2020 at 10:29 AM Robert Bradshaw  wrote:

 This is a really nice idea. Would the user still need to specify the
 timestamp of the output? I'm a bit ambivalent about calling it
 multiple times if OuptutReceiver alone is in the parameter list; this
 might not be obvious and could be surprising behavior.

 On Mon, May 4, 2020 at 10:13 AM Reuven Lax  wrote:
 >
 > I would like to discuss a minor extension to the Beam model.
 >
 > Beam bundles have very few restrictions on what can be in a bundle, in 
 > particular s bundle might contain records for many different windows. 
 > This was an explicit decision as bundling primarily exists for 
 > performance reasons and we found that limiting bundling based on windows 
 > or timestamps often l

Re: [DISCUSS] finishBundle once per window

2020-05-05 Thread Reuven Lax
It's a good question about startBundle - it's something I thought about.
The problem is that a runner doesn't always know at startBundle what
windows are in the bundle, and even if it does know it might require the
runner to run two passes over the bundle to figure this out. Alternatively
the runner could keep calling startBundle the first time it's seen a new
window in the bundle, but I think that makes things even weirder. It's also
worth noting that startBundle is already more limited today - we do not
support calling output from startBundle, but we do allow calling output
from finishBundle.

Reuven

On Mon, May 4, 2020 at 11:59 PM Jan Lukavský  wrote:

> Ah, interesting. That makes windowFn non-idempotent by definition, because
> its first application (e.g. global window -> interval window) _might_ yield
> different result than second application with interval window already
> assigned. On the other hand, let's suppose for a moment we can make
> windowFn idempotent, would that solve the issue of window assignment for
> elements output from finishBundle? I understand that window assignment is
> not only motivation for adding optional window parameter to @FinishBundle,
> but users might be confused why OutputReceiver is working only when there
> is Window parameter. It would be nice to have this somewhat more
> "consistent". And last note - adding the parameter to @FinishBundle seems a
> little imbalanced - could this be made possible for @StartBundle as well?
> Should we enforce that both @StartBundle and @FinishBundle have the same
> signature, or should we accept all combinations?
>
> Jan
> On 5/4/20 11:02 PM, Reuven Lax wrote:
>
> I assume you are referring to elements output from finishBundle.
>
> The problem is that the current window is an input to
> WindowFn.assignWindows. The new window can depend on the timestamp, the
> element itself, and the original window. I'm not sure how many users rely
> on this, however it has been part of our public windowing API for a long
> time, so I would guess that some users do use this in their WindowFns.
>
> Reuven
>
> On Mon, May 4, 2020 at 11:48 AM Jan Lukavský  wrote:
>
>> There was a mention in some other thread, that in order to make user
>> experience as predictable as possible, we should try to make windows
>> idempotent, and once window is assigned, it should be never changed (and
>> timestamp move outside of the scope of window, unless a different windowfn
>> is applied). Because all Beam window functions are actually time based, and
>> output timestamp is known, what is the issue of applying windowfn to
>> elements output from @FinishBundle and assign the windows automatically?
>> On 5/4/20 8:07 PM, Reuven Lax wrote:
>>
>> This should not affect the ability of the user to specify the output
>> timestamp. Today FinishBundleContext.output forces you to specify the
>> window as well as the timestamp, which is a bit awkward. (I believe that it
>> also lets you create brand new windows in finishBundle, which is
>> interesting, but I'm not quite sure of the use case).
>>
>> On Mon, May 4, 2020 at 10:29 AM Robert Bradshaw 
>> wrote:
>>
>>> This is a really nice idea. Would the user still need to specify the
>>> timestamp of the output? I'm a bit ambivalent about calling it
>>> multiple times if OuptutReceiver alone is in the parameter list; this
>>> might not be obvious and could be surprising behavior.
>>>
>>> On Mon, May 4, 2020 at 10:13 AM Reuven Lax  wrote:
>>> >
>>> > I would like to discuss a minor extension to the Beam model.
>>> >
>>> > Beam bundles have very few restrictions on what can be in a bundle, in
>>> particular s bundle might contain records for many different windows. This
>>> was an explicit decision as bundling primarily exists for performance
>>> reasons and we found that limiting bundling based on windows or timestamps
>>> often led to severe performance problems. However it sometimes makes
>>> finishBundle hard to use.
>>> >
>>> > I've seen multiple cases where users maintain some state in their DoFn
>>> that needs finalizing (e.g. writing to an external service) in
>>> finishBundle. Often users end up keeping lists of all windows seen in the
>>> bundle so they can be processed separately (or sometimes not realizing that
>>> their can be multiple windows and writing incorrect code).
>>> >
>>> > The lack of a window also means that we don't currently support
>>> injecting an OuptutReceiver into finishBundle, as there's no good way of
>>> knowing which window output should be put into.
>>> >
>>> > I would like to propose adding a way for finishBundle to inspect the
>>> window, either directly (via a BoundedWindow parameter) or indirectly (via
>>> an OutputReceiver parameter). In this case, we will execute finishBundle
>>> once per window in the bundle. Otherwise, we will execute finishBundle once
>>> at the end of the bundle as before. This behavior is backwards compatible,
>>> as previously these parameters were disallowed in finishBun

[DISCUSS] Dealing with @Ignored tests

2020-05-05 Thread Jan Lukavský

Hi,

it seems we are accumulating test cases (see discussion in [1]) that are 
marked as @Ignored (mostly due to flakiness), which is generally 
undesirable. Associated JIRAs seem to be open for a long time, and this 
might generally cause that we loose code coverage. Would anyone have 
idea on how to visualize these Ignored tests better? My first idea would 
be something similar to "Beam dependency check report", but that seems 
to be not the best example (which is completely different issue :)).


Jan

[1] https://github.com/apache/beam/pull/11614



Re: Apache Beam application to Season of Docs 2020

2020-05-05 Thread Aizhamal Nurmamat kyzy
Thank you, Kyle! really appreciate it! I will add you as a mentor into the
cwiki page, and let you know if Beam gets accepted to the program on May
11th.

On Tue, May 5, 2020 at 12:55 PM Kyle Weaver  wrote:

> Thanks Aizhamal! I would be willing to help with project 1.
>
> On Mon, May 4, 2020 at 5:04 PM Aizhamal Nurmamat kyzy 
> wrote:
>
>> Hi all,
>>
>> I have submitted the application to the Season of Docs program with the
>> project ideas we have developed last year [1]. I learnt about its deadline
>> a few hours ago and didn't want to miss it.
>>
>> Feel free to add more project ideas (or edit the current ones) until May
>> 7th.
>>
>> If Beam gets approved, we will get 1 or 2 experienced technical writers
>> to help us document community processes or some Beam features. Is anyone
>> else willing to mentor for these projects?
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs
>>
>


Re: Google Summer of Code 2020 [Accepted Proposal]

2020-05-05 Thread Chamikara Jayalath
Congrats!

On Tue, May 5, 2020 at 1:34 PM Robert Bradshaw  wrote:

> Congratulations and welcome!
>
> On Tue, May 5, 2020 at 12:46 PM Ismaël Mejía  wrote:
> >
> > Hello Aldair,
> > You were added in JIRA as a contributor.
> > Welcome to the project!
> >
> > On Tue, May 5, 2020 at 8:59 PM Aldair Coronel Ruíz
> >  wrote:
> > >
> > > Hi everyone!
> > >
> > > I'm Aldair. My GSoC 2020 project proposal was accepted! It will be my
> first time working on an open-source project and I'm excited.  My goal is
> to add support for Azure Blobstore file system in Apache Beam Python SDK.
> You can find my proposal here [1]
> > >
> > > Pablo Estrada and Ismaël Mejía are my mentors and I just want to say
> thank you in advance. :)
> > >
> > > If you have any comments or suggestions feel free to let me know.
> > >
> > > One more thing:  Can someone add me as a contributor for Beam's Jira
> issue tracker? My username is Aldair
> > >
> > > Thank you
> > >
> > > [1]
> https://docs.google.com/document/d/1iax7Rkxoe3dxsAdxLwc8bkFGM6axom0foOePyjKI-6Y/edit
> > > --
> > > Aldair Coronel Ruiz
> > >
>


Re: Google Summer of Code 2020 [Accepted Proposal]

2020-05-05 Thread Robert Bradshaw
Congratulations and welcome!

On Tue, May 5, 2020 at 12:46 PM Ismaël Mejía  wrote:
>
> Hello Aldair,
> You were added in JIRA as a contributor.
> Welcome to the project!
>
> On Tue, May 5, 2020 at 8:59 PM Aldair Coronel Ruíz
>  wrote:
> >
> > Hi everyone!
> >
> > I'm Aldair. My GSoC 2020 project proposal was accepted! It will be my first 
> > time working on an open-source project and I'm excited.  My goal is to add 
> > support for Azure Blobstore file system in Apache Beam Python SDK.  You can 
> > find my proposal here [1]
> >
> > Pablo Estrada and Ismaël Mejía are my mentors and I just want to say thank 
> > you in advance. :)
> >
> > If you have any comments or suggestions feel free to let me know.
> >
> > One more thing:  Can someone add me as a contributor for Beam's Jira issue 
> > tracker? My username is Aldair
> >
> > Thank you
> >
> > [1] 
> > https://docs.google.com/document/d/1iax7Rkxoe3dxsAdxLwc8bkFGM6axom0foOePyjKI-6Y/edit
> > --
> > Aldair Coronel Ruiz
> >


Re: Apache Beam application to Season of Docs 2020

2020-05-05 Thread Kyle Weaver
Thanks Aizhamal! I would be willing to help with project 1.

On Mon, May 4, 2020 at 5:04 PM Aizhamal Nurmamat kyzy 
wrote:

> Hi all,
>
> I have submitted the application to the Season of Docs program with the
> project ideas we have developed last year [1]. I learnt about its deadline
> a few hours ago and didn't want to miss it.
>
> Feel free to add more project ideas (or edit the current ones) until May
> 7th.
>
> If Beam gets approved, we will get 1 or 2 experienced technical writers to
> help us document community processes or some Beam features. Is anyone else
> willing to mentor for these projects?
>
> [1] https://cwiki.apache.org/confluence/display/BEAM/Google+Season+of+Docs
>


Re: Java pre-commit stability

2020-05-05 Thread Ismaël Mejía
Both are well known issues that I have been impacted by multiple times.
Both have associated JIRAs but it seems nobody is really working on them.

https://issues.apache.org/jira/browse/BEAM-9164
https://issues.apache.org/jira/browse/BEAM-8035

On Tue, May 5, 2020 at 9:43 PM Jan Lukavský  wrote:
>
> Hi,
>
> I'm experiencing failures in java precommit checks. Last three failures
> (unrelated to PR) I've seen were:
>
>   -
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapperTest$ParameterizedUnboundedSourceWrapperTest.testWatermarkEmission[numTasks
> = 2; numSplits=2]
>
>(org.junit.runners.model.TestTimedOutException: test timed out after
> 3 milliseconds)
>
>   -
> org.apache.beam.sdk.transforms.WatchTest.testMultiplePollsWithManyResults
>
>(java.lang.AssertionError: Drop timestamped
> input/Values/Map/ParMultiDo(Anonymous).output: Outputs must be in
> timestamp order)
>
>   -
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapperTest$ParameterizedUnboundedSourceWrapperTest.testWatermarkEmission[numTasks
> = 1; numSplits=1]
>
>(org.junit.runners.model.TestTimedOutException: test timed out after
> 3 milliseconds)
>
> This might be somewhat related to infrastructure (although probably
> signals that these tests are flaky), is anyone else experiencing this?
>
>   Jan
>


Re: Google Summer of Code 2020 [Accepted Proposal]

2020-05-05 Thread Ismaël Mejía
Hello Aldair,
You were added in JIRA as a contributor.
Welcome to the project!

On Tue, May 5, 2020 at 8:59 PM Aldair Coronel Ruíz
 wrote:
>
> Hi everyone!
>
> I'm Aldair. My GSoC 2020 project proposal was accepted! It will be my first 
> time working on an open-source project and I'm excited.  My goal is to add 
> support for Azure Blobstore file system in Apache Beam Python SDK.  You can 
> find my proposal here [1]
>
> Pablo Estrada and Ismaël Mejía are my mentors and I just want to say thank 
> you in advance. :)
>
> If you have any comments or suggestions feel free to let me know.
>
> One more thing:  Can someone add me as a contributor for Beam's Jira issue 
> tracker? My username is Aldair
>
> Thank you
>
> [1] 
> https://docs.google.com/document/d/1iax7Rkxoe3dxsAdxLwc8bkFGM6axom0foOePyjKI-6Y/edit
> --
> Aldair Coronel Ruiz
>


Java pre-commit stability

2020-05-05 Thread Jan Lukavský

Hi,

I'm experiencing failures in java precommit checks. Last three failures 
(unrelated to PR) I've seen were:


 - 
org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapperTest$ParameterizedUnboundedSourceWrapperTest.testWatermarkEmission[numTasks 
= 2; numSplits=2]


  (org.junit.runners.model.TestTimedOutException: test timed out after 
3 milliseconds)


 - 
org.apache.beam.sdk.transforms.WatchTest.testMultiplePollsWithManyResults


  (java.lang.AssertionError: Drop timestamped 
input/Values/Map/ParMultiDo(Anonymous).output: Outputs must be in 
timestamp order)


 - 
org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapperTest$ParameterizedUnboundedSourceWrapperTest.testWatermarkEmission[numTasks 
= 1; numSplits=1]


  (org.junit.runners.model.TestTimedOutException: test timed out after 
3 milliseconds)


This might be somewhat related to infrastructure (although probably 
signals that these tests are flaky), is anyone else experiencing this?


 Jan



Re: Google Summer of Code 2020 [Accepted Proposal]

2020-05-05 Thread Kyle Weaver
Congrats and welcome to the project!

On Tue, May 5, 2020 at 2:59 PM Aldair Coronel Ruíz <
aldai...@ciencias.unam.mx> wrote:

> Hi everyone!
>
> I'm Aldair. My GSoC 2020 project proposal was accepted! It will be my
> first time working on an open-source project and I'm excited.  My goal is
> to add support for Azure Blobstore file system in Apache Beam Python SDK.
> You can find my proposal here [1]
>
> Pablo Estrada and Ismaël Mejía are my mentors and I just want to say thank
> you in advance. :)
>
> If you have any comments or suggestions feel free to let me know.
>
> *One more thing:*  Can someone add me as a contributor for Beam's Jira
> issue tracker? My username is *Aldair *
>
> Thank you
>
> [1]
> https://docs.google.com/document/d/1iax7Rkxoe3dxsAdxLwc8bkFGM6axom0foOePyjKI-6Y/edit
> --
> *Aldair Coronel Ruiz*
>
>


Re: Google Summer of Coding Proposal

2020-05-05 Thread Ismaël Mejía
I just added your id to JIRA. Welcome Qihang!

On Tue, May 5, 2020 at 6:25 PM Qihang Zeng  wrote:
>
> Just a quick follow-up. I wonder if I could be added to the Jira dev list. 
> Here is my ID: QZMark.
>
> Much appreciated,
> Qihang
>
> On Tue, May 5, 2020 at 2:43 PM Qihang Zeng  wrote:
>>
>> Hi Beam Dev Team,
>>
>> I'm Qihang. I just found an acceptance letter from GSoC. I am very grateful 
>> and excited to be selected as a participant of Apache's summer projects!
>>
>> I look forward to dedicating to some projects in the coming months. I will 
>> work closely with my mentor to make sure I am ready for the proposed project 
>> before the summer period starts.
>>
>> Sincerely,
>> Qihang
>>
>>
>>
>> On Mon, Mar 30, 2020 at 12:46 PM Qihang Zeng  wrote:
>>>
>>> Dear Beam Community,
>>>
>>> Hello! My name is Qihang and I am a candidate for this year's google summer 
>>> of coding. I am a senior in CS major at New York University Shanghai. 
>>> Previously, I had some experience with Apache Parquet, Apache Carbondata 
>>> and Apache Spark. I have long been interested in participating in some ASF 
>>> projects, particularly databases projects.
>>>
>>> One of the Beam projects caught my attention and I started to contact 
>>> mentor Mr. Rui Wang about 2 weeks ago. Mr. Wang gave me some advice and I 
>>> finished a proposal which is attached below. Though it is now very close to 
>>> the deadline, do not hesitate to give any comments. : ) I have also shared 
>>> the draft through the google summer of coding portal and anyone could edit 
>>> it or review it.
>>>
>>> Sincerely,
>>> Qihang


Google Summer of Code 2020 [Accepted Proposal]

2020-05-05 Thread Aldair Coronel Ruíz
Hi everyone!

I'm Aldair. My GSoC 2020 project proposal was accepted! It will be my first
time working on an open-source project and I'm excited.  My goal is to add
support for Azure Blobstore file system in Apache Beam Python SDK.  You can
find my proposal here [1]

Pablo Estrada and Ismaël Mejía are my mentors and I just want to say thank
you in advance. :)

If you have any comments or suggestions feel free to let me know.

*One more thing:*  Can someone add me as a contributor for Beam's Jira
issue tracker? My username is *Aldair *

Thank you

[1]
https://docs.google.com/document/d/1iax7Rkxoe3dxsAdxLwc8bkFGM6axom0foOePyjKI-6Y/edit
-- 
*Aldair Coronel Ruiz*


Re: [DISCUSS] How many Python 3.x minor versions should Beam Python SDK aim to support concurrently?

2020-05-05 Thread Valentyn Tymofieiev
On Mon, May 4, 2020 at 7:06 PM Yoshiki Obata 
wrote:

> Thank you for comment, Valentyn.
>
> > 1) We can seed the smoke test suite with typehints tests, and add more
> tests later if there is a need. We can identify them by the file path or by
> special attributes in test files. Identifying them using filepath seems
> simpler and independent of test runner.
>
> Yes, making run_pylint.sh allow target test file paths as arguments is
> good way if could.
>

Not sure how run_pylint.sh is related here - we should run linter on the
entire codebase.


> > 3)  We should reduce the code duplication across
> beam/sdks/python/test-suites/$runner/py3*. I think we could move the suite
> definition into a common file like
> beam/sdks/python/test-suites/$runner/build.gradle perhaps, and populate
> individual suites (beam/sdks/python/test-suites/$runner/py38/build.gradle)
> including the common file and/or logic from PythonNature [1].
>
> Exactly. I'll check it out.
>
> > 4) We have some tests that we run only under specific Python 3 versions,
> for example: FlinkValidatesRunner test runs using Python 3.5: [2]
> > HDFS Python 3 tests are running only with Python 3.7 [3]. Cross-language
> Py3 tests for Spark are running under Python 3.5[4]: , there may be more
> test suites that selectively use particular versions.
> > We need to correct such suites, so that we do not tie them  to a
> specific Python version. I see several options here: such tests should run
> either for all high-priority versions, or run only under the lowest version
> among the high-priority versions.  We don't have to fix them all at the
> same time. In general, we should try to make it as easy as possible to
> configure, whether a suite runs across all  versions, all high-priority
> versions, or just one version.
>
> The way of high-priority/low-priority configuration would be useful for
> this.
> And which versions to be tested may be related to 5).
>
> > 5) If postcommit suites (that need to run against all versions) still
> constitute too much load on the infrastructure, we may need to investigate
> how to run these suites less frequently.
>
> That's certainly true, beam_PostCommit_PythonXX and
> beam_PostCommit_Python_Chicago_Taxi_(Dataflow|Flink) take about 1
> hour.
> Does anyone have knowledge about this?


I am familiar with beam_PostCommit_PythonXX suites. Is there something
specific about these suites that you wanted to know?
Test suite runtime will depend on the number of  tests in the suite, how
many tests we run in parallel, how long they take to run. To understand the
load on test infrastructure we can monitor Beam test health metrics [1]. In
particular, if time in queue[2] is high, it is a sign that there are not
enough Jenkins slots available to start the test suite earlier.

[1] http://104.154.241.245/d/D81lW0pmk/post-commit-test-reliability
[2]
http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1588094891600&to=1588699691600&panelId=6&fullscreen



> 2020年5月2日(土) 5:18 Valentyn Tymofieiev :
> >
> > Hi Yoshiki,
> >
> > Thanks a lot for your help with Python 3 support so far and most
> recently, with your work on Python 3.8.
> >
> > Overall the proposal sounds good to me. I see several aspects here that
> we need to address:
> >
> > 1) We can seed the smoke test suite with typehints tests, and add more
> tests later if there is a need. We can identify them by the file path or by
> special attributes in test files. Identifying them using filepath seems
> simpler and independent of test runner.
> >
> > 2) Defining high priority/low priority versions in gradle.properties
> sounds good to me.
> >
> > 3)  We should reduce the code duplication across
> beam/sdks/python/test-suites/$runner/py3*. I think we could move the suite
> definition into a common file like
> beam/sdks/python/test-suites/$runner/build.gradle perhaps, and populate
> individual suites (beam/sdks/python/test-suites/$runner/py38/build.gradle)
> including the common file and/or logic from PythonNature [1].
> >
> > 4) We have some tests that we run only under specific Python 3 versions,
> for example: FlinkValidatesRunner test runs using Python 3.5: [2]
> > HDFS Python 3 tests are running only with Python 3.7 [3]. Cross-language
> Py3 tests for Spark are running under Python 3.5[4]: , there may be more
> test suites that selectively use particular versions.
> >
> > We need to correct such suites, so that we do not tie them  to a
> specific Python version. I see several options here: such tests should run
> either for all high-priority versions, or run only under the lowest version
> among the high-priority versions.  We don't have to fix them all at the
> same time. In general, we should try to make it as easy as possible to
> configure, whether a suite runs across all  versions, all high-priority
> versions, or just one version.
> >
> > 5) If postcommit suites (that need to run against all versions) still
> constitute too much load on the infrastructure, w

Re: GSoC 2020: Congratulations, your proposal with The Apache Software Foundation has been accepted!

2020-05-05 Thread Rui Wang
Hi Deepak,

I am afraid this year's GSoC projects selection has finished [1]. You have
to wait for next time (probably year 2021).



[1]: https://summerofcode.withgoogle.com/how-it-works/#

-Rui

On Tue, May 5, 2020 at 1:51 AM deepak kumar  wrote:

> I would like to contribute as well to the GSoC.
> Can i work on any?
>
> Thanks
> Deepak
>
> On Tue, May 5, 2020 at 8:37 AM John Mora  wrote:
>
>> Hi all.
>>
>> My proposal for GSoC was accepted, so this summer I will be working with
>> you guys in the aggregation analytics functionality of Beam. Thanks so much
>> for your support during the application period, specially to my mentor Rui
>> Wang.
>>
>> Please let me know if you have suggestions or ideas for my project.
>>
>> Cheers,
>> John
>>
>> -- Forwarded message -
>> De: Google Summer of Code 
>> Date: lun., 4 may. 2020 a las 12:53
>> Subject: GSoC 2020: Congratulations, your proposal with The Apache
>> Software Foundation has been accepted!
>> To: 
>>
>>
>> [image: Google Summer of Code]
>>
>> Hi John Mora,
>>
>> Your proposal BeamSQL aggregation analytics functionality
>> 
>> has been accepted!
>>
>> Welcome to GSoC 2020!
>>
>> We look forward to seeing the great things you will accomplish this
>> summer with The Apache Software Foundation.
>>
>> The next thing you need to do is read the Information for Accepted
>> Students
>> .
>> It contains important information you need to know about your participation
>> in GSoC 2020.
>>
>> You will receive another email in the next few days with information
>> about your stipend.
>>
>> If you have any questions, please email the Google Summer of Code support
>> team at gsoc-supp...@google.com.
>>
>> Have a great summer!
>>
>> -*Google Summer of Code team*
>>
>> This email was sent to jhnmora...@gmail.com.
>>
>> You are receiving this email because of your participation in Google
>> Summer of Code 2020.
>> https://summerofcode.withgoogle.com
>>
>> To leave the program and stop receiving all emails, you can go to your
>> profile  and
>> request deletion of your program profile.
>>
>> For any questions, please contact gsoc-supp...@google.com. Replies to
>> this message go to an unmonitored mailbox.
>>
>> © 2020 Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043,
>> USA
>>
>


Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-05-05 Thread Eleanore Jin
Hi Max,

Thanks for the info!

Eleanore

On Tue, May 5, 2020 at 4:01 AM Maximilian Michels  wrote:

> Hey Eleanore,
>
> The change will be part of the 2.21.0 release.
>
> -Max
>
> On 04.05.20 19:14, Eleanore Jin wrote:
> > Hi Max,
> >
> > Thanks for the information and I saw this PR is already merged, just
> > wonder is it backported to the affected versions already
> > (i.e. 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0)? Or I have
> > to wait for the 2.20.1 release?
> >
> > Thanks a lot!
> > Eleanore
> >
> > On Wed, Apr 22, 2020 at 2:31 AM Maximilian Michels  > > wrote:
> >
> > Hi Eleanore,
> >
> > Exactly-once is not affected but the pipeline can fail to checkpoint
> > after the maximum number of state cells have been reached. We are
> > working on a fix [1].
> >
> > Cheers,
> > Max
> >
> > [1] https://github.com/apache/beam/pull/11478
> >
> > On 22.04.20 07:19, Eleanore Jin wrote:
> > > Hi Maxi,
> > >
> > > I assume this will impact the Exactly Once Semantics that beam
> > provided
> > > as in the KafkaExactlyOnceSink, the processElement method is also
> > > annotated with @RequiresStableInput?
> > >
> > > Thanks a lot!
> > > Eleanore
> > >
> > > On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels
> > mailto:m...@apache.org>
> > > >> wrote:
> > >
> > > Hi Stephen,
> > >
> > > Thanks for reporting the issue! David, good catch!
> > >
> > > I think we have to resort to only using a single state cell for
> > > buffering on checkpoints, instead of using a new one for every
> > > checkpoint. I was under the assumption that, if the state cell
> was
> > > cleared, it would not be checkpointed but that does not seem
> to be
> > > the case.
> > >
> > > Thanks,
> > > Max
> > >
> > > On 21.04.20 09:29, David Morávek wrote:
> > > > Hi Stephen,
> > > >
> > > > nice catch and awesome report! ;) This definitely needs a
> > proper fix.
> > > > I've created a new JIRA to track the issue and will try to
> > resolve it
> > > > soon as this seems critical to me.
> > > >
> > > > https://issues.apache.org/jira/browse/BEAM-9794
> > > >
> > > > Thanks,
> > > > D.
> > > >
> > > > On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel
> > > mailto:stephenpate...@gmail.com>
> > >
> > > >  > 
> > >  >  > > >
> > > > I was able to reproduce this in a unit test:
> > > >
> > > > @Test
> > > >
> > > >   *public* *void* test() *throws*
> InterruptedException,
> > > > ExecutionException {
> > > >
> > > > FlinkPipelineOptions options =
> > > >
> >  PipelineOptionsFactory./as/(FlinkPipelineOptions.*class*);
> > > >
> > > > options.setCheckpointingInterval(10L);
> > > >
> > > > options.setParallelism(1);
> > > >
> > > > options.setStreaming(*true*);
> > > >
> > > > options.setRunner(FlinkRunner.*class*);
> > > >
> > > > options.setFlinkMaster("[local]");
> > > >
> > > > options.setStateBackend(*new*
> > > > MemoryStateBackend(Integer.*/MAX_VALUE/*));
> > > >
> > > > Pipeline pipeline = Pipeline./create/(options);
> > > >
> > > > pipeline
> > > >
> > > > .apply(Create./of/((Void) *null*))
> > > >
> > > > .apply(
> > > >
> > > > ParDo./of/(
> > > >
> > > > *new* DoFn() {
> > > >
> > > >
> > > >   *private* *static* *final* *long*
> > > > */serialVersionUID/* = 1L;
> > > >
> > > >
> > > >   @RequiresStableInput
> > > >
> > > >   @ProcessElement
> > > >
> > > >   *public* *void* processElement() {}
> > > >
> > > > }));
> > > >
> > > > pipeline.run();
> > > >
> > > >   }
> > > >
> > > >
> > > > It took a while to get to checkpoint 32,767, but
> > eventually it
> > > did,
> > > > and it failed with the same error I listed ab

Re: Builtin IOs - Link to Java/Pydoc instead of code?

2020-05-05 Thread Brian Hulette
+1 for doc links instead of code. I think from a user perspective the code
link is effectively the same as javadoc/pydoc since they'll just peruse the
docstrings, except it's harder to read and reflects the behavior at HEAD,
not at any release.

Brian

On Mon, May 4, 2020 at 6:21 PM Pablo Estrada  wrote:

> Hi all,
> I just noted that in our Built-in IOs page[1], we tend to link to the code
> for the IOs that we mention.
>
> I think it would be better to link to the Javadoc or the Pydoc for those
> IOs instead. Thoughts?
> Best
> -P.
>
> [1] https://beam.apache.org/documentation/io/built-in/
>


Re: Google Summer of Coding Proposal

2020-05-05 Thread Qihang Zeng
Just a quick follow-up. I wonder if I could be added to the Jira dev list.
Here is my ID: QZMark.

Much appreciated,
Qihang

On Tue, May 5, 2020 at 2:43 PM Qihang Zeng  wrote:

> Hi Beam Dev Team,
>
> I'm Qihang. I just found an acceptance letter from GSoC. I am very
> grateful and excited to be selected as a participant of Apache's summer
> projects!
>
> I look forward to dedicating to some projects in the coming months. I will
> work closely with my mentor to make sure I am ready for the proposed
> project before the summer period starts.
>
> Sincerely,
> Qihang
>
>
>
> On Mon, Mar 30, 2020 at 12:46 PM Qihang Zeng  wrote:
>
>> Dear Beam Community,
>>
>> Hello! My name is Qihang and I am a candidate for this year's google
>> summer of coding. I am a senior in CS major at New York University
>> Shanghai. Previously, I had some experience with Apache Parquet, Apache
>> Carbondata and Apache Spark. I have long been interested in participating
>> in some ASF projects, particularly databases projects.
>>
>> One of the Beam projects caught my attention and I started to contact
>> mentor Mr. Rui Wang about 2 weeks ago. Mr. Wang gave me some advice and I
>> finished a proposal which is attached below. Though it is now very close to
>> the deadline, do not hesitate to give any comments. : ) I have also shared
>> the draft through the google summer of coding portal and anyone could edit
>> it or review it.
>>
>> Sincerely,
>> Qihang
>>
>


Re: Google Summer of Coding Proposal

2020-05-05 Thread Qihang Zeng
Hi Beam Dev Team,

I'm Qihang. I just found an acceptance letter from GSoC. I am very grateful
and excited to be selected as a participant of Apache's summer projects!

I look forward to dedicating to some projects in the coming months. I will
work closely with my mentor to make sure I am ready for the proposed
project before the summer period starts.

Sincerely,
Qihang



On Mon, Mar 30, 2020 at 12:46 PM Qihang Zeng  wrote:

> Dear Beam Community,
>
> Hello! My name is Qihang and I am a candidate for this year's google
> summer of coding. I am a senior in CS major at New York University
> Shanghai. Previously, I had some experience with Apache Parquet, Apache
> Carbondata and Apache Spark. I have long been interested in participating
> in some ASF projects, particularly databases projects.
>
> One of the Beam projects caught my attention and I started to contact
> mentor Mr. Rui Wang about 2 weeks ago. Mr. Wang gave me some advice and I
> finished a proposal which is attached below. Though it is now very close to
> the deadline, do not hesitate to give any comments. : ) I have also shared
> the draft through the google summer of coding portal and anyone could edit
> it or review it.
>
> Sincerely,
> Qihang
>


Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-05-05 Thread Maximilian Michels
Hey Eleanore,

The change will be part of the 2.21.0 release.

-Max

On 04.05.20 19:14, Eleanore Jin wrote:
> Hi Max, 
> 
> Thanks for the information and I saw this PR is already merged, just
> wonder is it backported to the affected versions already
> (i.e. 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0)? Or I have
> to wait for the 2.20.1 release? 
> 
> Thanks a lot!
> Eleanore
> 
> On Wed, Apr 22, 2020 at 2:31 AM Maximilian Michels  > wrote:
> 
> Hi Eleanore,
> 
> Exactly-once is not affected but the pipeline can fail to checkpoint
> after the maximum number of state cells have been reached. We are
> working on a fix [1].
> 
> Cheers,
> Max
> 
> [1] https://github.com/apache/beam/pull/11478
> 
> On 22.04.20 07:19, Eleanore Jin wrote:
> > Hi Maxi, 
> >
> > I assume this will impact the Exactly Once Semantics that beam
> provided
> > as in the KafkaExactlyOnceSink, the processElement method is also
> > annotated with @RequiresStableInput?
> >
> > Thanks a lot!
> > Eleanore
> >
> > On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels
> mailto:m...@apache.org>
> > >> wrote:
> >
> >     Hi Stephen,
> >
> >     Thanks for reporting the issue! David, good catch!
> >
> >     I think we have to resort to only using a single state cell for
> >     buffering on checkpoints, instead of using a new one for every
> >     checkpoint. I was under the assumption that, if the state cell was
> >     cleared, it would not be checkpointed but that does not seem to be
> >     the case.
> >
> >     Thanks,
> >     Max
> >
> >     On 21.04.20 09:29, David Morávek wrote:
> >     > Hi Stephen,
> >     >
> >     > nice catch and awesome report! ;) This definitely needs a
> proper fix.
> >     > I've created a new JIRA to track the issue and will try to
> resolve it
> >     > soon as this seems critical to me.
> >     >
> >     > https://issues.apache.org/jira/browse/BEAM-9794
> >     >
> >     > Thanks,
> >     > D.
> >     >
> >     > On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel
> >     mailto:stephenpate...@gmail.com>
> >
> >     >  
> >       >     >
> >     >     I was able to reproduce this in a unit test:
> >     >
> >     >         @Test
> >     >
> >     >           *public* *void* test() *throws* InterruptedException,
> >     >         ExecutionException {
> >     >
> >     >             FlinkPipelineOptions options =
> >     >       
>  PipelineOptionsFactory./as/(FlinkPipelineOptions.*class*);
> >     >
> >     >             options.setCheckpointingInterval(10L);
> >     >
> >     >             options.setParallelism(1);
> >     >
> >     >             options.setStreaming(*true*);
> >     >
> >     >             options.setRunner(FlinkRunner.*class*);
> >     >
> >     >             options.setFlinkMaster("[local]");
> >     >
> >     >             options.setStateBackend(*new*
> >     >         MemoryStateBackend(Integer.*/MAX_VALUE/*));
> >     >
> >     >             Pipeline pipeline = Pipeline./create/(options);
> >     >
> >     >             pipeline
> >     >
> >     >                 .apply(Create./of/((Void) *null*))
> >     >
> >     >                 .apply(
> >     >
> >     >                     ParDo./of/(
> >     >
> >     >                         *new* DoFn() {
> >     >
> >     >
> >     >                           *private* *static* *final* *long*
> >     >         */serialVersionUID/* = 1L;
> >     >
> >     >
> >     >                           @RequiresStableInput
> >     >
> >     >                           @ProcessElement
> >     >
> >     >                           *public* *void* processElement() {}
> >     >
> >     >                         }));
> >     >
> >     >             pipeline.run();
> >     >
> >     >           }
> >     >
> >     >
> >     >     It took a while to get to checkpoint 32,767, but
> eventually it
> >     did,
> >     >     and it failed with the same error I listed above.
> >     >
> >     >     On Thu, Apr 16, 2020 at 11:26 AM Stephen Patel
> >     >        >
> >        

Re: GSoC 2020: Congratulations, your proposal with The Apache Software Foundation has been accepted!

2020-05-05 Thread deepak kumar
I would like to contribute as well to the GSoC.
Can i work on any?

Thanks
Deepak

On Tue, May 5, 2020 at 8:37 AM John Mora  wrote:

> Hi all.
>
> My proposal for GSoC was accepted, so this summer I will be working with
> you guys in the aggregation analytics functionality of Beam. Thanks so much
> for your support during the application period, specially to my mentor Rui
> Wang.
>
> Please let me know if you have suggestions or ideas for my project.
>
> Cheers,
> John
>
> -- Forwarded message -
> De: Google Summer of Code 
> Date: lun., 4 may. 2020 a las 12:53
> Subject: GSoC 2020: Congratulations, your proposal with The Apache
> Software Foundation has been accepted!
> To: 
>
>
> [image: Google Summer of Code]
>
> Hi John Mora,
>
> Your proposal BeamSQL aggregation analytics functionality
> 
> has been accepted!
>
> Welcome to GSoC 2020!
>
> We look forward to seeing the great things you will accomplish this summer
> with The Apache Software Foundation.
>
> The next thing you need to do is read the Information for Accepted
> Students
> .
> It contains important information you need to know about your participation
> in GSoC 2020.
>
> You will receive another email in the next few days with information about
> your stipend.
>
> If you have any questions, please email the Google Summer of Code support
> team at gsoc-supp...@google.com.
>
> Have a great summer!
>
> -*Google Summer of Code team*
>
> This email was sent to jhnmora...@gmail.com.
>
> You are receiving this email because of your participation in Google
> Summer of Code 2020.
> https://summerofcode.withgoogle.com
>
> To leave the program and stop receiving all emails, you can go to your
> profile  and
> request deletion of your program profile.
>
> For any questions, please contact gsoc-supp...@google.com. Replies to
> this message go to an unmonitored mailbox.
>
> © 2020 Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
>


Re: Flink: Lost pane timing at some steps of pipeline

2020-05-05 Thread Jozef Vilcek
Then maybe the better question is, what is the behaviour / guarantee of
propagating PaneInfo between steps of the pipeline.

If I do write files like this:

PCollection> destFileNames =
windowedData.apply(FileIO.write() ...).getPerDestinationOutputFilenames

Then even if data written to files are windowed and materialised files are
associated to certain triggered panes, the `destFileNames` pcollection does
not necessarily carry such information. It is runner depended behaviour. In
older versions of Beam pane info was propagated. The reason is that
internally, WriteFiles does use Reshuffle (and many other parts of Beam
does too).
Now is this expected with respect to model and API? How does actually
paneInfo "get lost" in case of doing flink rebalance?


On Tue, May 5, 2020 at 7:39 AM David Morávek 
wrote:

> Hi Jozef, I think this is expected beahior as Flink does not use default
> expansion for Reshuffle (uses round-robin rebalance ship strategy instead).
> There is no aggregation that needs buffering (and triggering). All of the
> elements are immediately emmited to downstream operations after the
> Reshuffle.
>
> In case of direct runner, this is just a side-effect of Reshuffle
> expansion. See
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L69
> for more details.
>
> I don't think we should expect Reshuffle to have the same semantics as
> GBK, because it's only an performance optimization steps, that should not
> have any effect to pipeline's overall result. Some runners may also
> completely ignore this step as part of execution plan optimization process
> (eg. two reshuffles in a row are idempotent). (
> https://issues.apache.org/jira/browse/BEAM-9824)
>
> D.
>
> On Mon, May 4, 2020 at 2:48 PM Jozef Vilcek  wrote:
>
>> I have a pipeline which
>>
>> 1. Read from KafkaIO
>> 2. Does stuff with events and writes windowed file via FileIO
>> 3. Apply statefull DoFn on written files info
>>
>> The statefull DoFn does some logic which depends on PaneInfo.Timing, if
>> it is EARLY or something else. When testing in DirectRunner, all is good.
>> But with FlinkRunner, panes are always NO_FIRING.
>>
>> To demonstrate this, here is a dummy test pipeline:
>>
>> val testStream = sc.testStream(testStreamOf[String]
>>   .advanceWatermarkTo(new Instant(1))
>>   .addElements(goodMessage, goodMessage)
>>   .advanceWatermarkTo(new Instant(2))
>>   .addElements(goodMessage, goodMessage)
>>   .advanceWatermarkTo(new Instant(200))
>>   .addElements(goodMessage, goodMessage)
>>   .advanceWatermarkToInfinity())
>>
>> testStream
>>   .withFixedWindows(
>> duration = Duration.standardSeconds(1),
>> options = WindowOptions(
>>   trigger = AfterWatermark.pastEndOfWindow()
>> .withEarlyFirings(AfterPane.elementCountAtLeast(1))
>> .withLateFirings(AfterPane.elementCountAtLeast(1)),
>>   accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES,
>>   allowedLateness = Duration.standardDays(1)
>> ))
>>   .keyBy(_ => "static_key")
>>   .withPaneInfo
>>   .map { case (element, paneInfo) =>
>> println(s"#1 - $paneInfo")
>> element
>>   }
>>   //.groupByKey // <- need to uncomment this for Flink to work
>>   .applyTransform(Reshuffle.viaRandomKey())
>>   .withPaneInfo
>>   .map { case (element, paneInfo) =>
>> println(s"#2 - $paneInfo")
>> element
>>   }
>>
>> When executed with DirectRunner, #1 prints pane with UNKNOWN timing and
>> #2 with EARLY, which is what I expect. When run with Flink runner, both #1
>> and #2 writes UNKNOWN timing from PaneInfo.NO_FIRING. Only if I add extra
>> GBK, then #2 writes panes with EARLY timing.
>>
>> This is run on Beam 2.19. I was trying to analyze where could be a
>> problem but got lost. I will be happy for any suggestions or pointers. Does
>> it sounds like bug or am I doing something wrong?
>>
>


Re: GSoC 2020: Congratulations, your proposal with The Apache Software Foundation has been accepted!

2020-05-05 Thread Rui Wang
Congratulations! Welcome to Beam community!

-Rui

On Mon, May 4, 2020, 8:28 PM Kai Jiang  wrote:

> Congratulations!
>
> On Mon, May 4, 2020 at 8:07 PM John Mora  wrote:
>
>> Hi all.
>>
>> My proposal for GSoC was accepted, so this summer I will be working with
>> you guys in the aggregation analytics functionality of Beam. Thanks so much
>> for your support during the application period, specially to my mentor Rui
>> Wang.
>>
>> Please let me know if you have suggestions or ideas for my project.
>>
>> Cheers,
>> John
>>
>> -- Forwarded message -
>> De: Google Summer of Code 
>> Date: lun., 4 may. 2020 a las 12:53
>> Subject: GSoC 2020: Congratulations, your proposal with The Apache
>> Software Foundation has been accepted!
>> To: 
>>
>>
>> [image: Google Summer of Code]
>>
>> Hi John Mora,
>>
>> Your proposal BeamSQL aggregation analytics functionality
>> 
>> has been accepted!
>>
>> Welcome to GSoC 2020!
>>
>> We look forward to seeing the great things you will accomplish this
>> summer with The Apache Software Foundation.
>>
>> The next thing you need to do is read the Information for Accepted
>> Students
>> .
>> It contains important information you need to know about your participation
>> in GSoC 2020.
>>
>> You will receive another email in the next few days with information
>> about your stipend.
>>
>> If you have any questions, please email the Google Summer of Code support
>> team at gsoc-supp...@google.com.
>>
>> Have a great summer!
>>
>> -*Google Summer of Code team*
>>
>> This email was sent to jhnmora...@gmail.com.
>>
>> You are receiving this email because of your participation in Google
>> Summer of Code 2020.
>> https://summerofcode.withgoogle.com
>>
>> To leave the program and stop receiving all emails, you can go to your
>> profile  and
>> request deletion of your program profile.
>>
>> For any questions, please contact gsoc-supp...@google.com. Replies to
>> this message go to an unmonitored mailbox.
>>
>> © 2020 Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043,
>> USA
>>
>