NOTICE: New Python PreCommit jobs

2019-09-24 Thread Chad Dombrova
Hi all,
I'm working to make the CI experience with python a bit better, and my
current initiative is splitting up the giant Python PreCommit job into 5
separate jobs into separate jobs for Lint, Py2, Py3.5, Py3.6, and Py3.7.

Around 11am Pacific time tomorrow I'm going to initiate the seed jobs, at
which point all PRs will start to run the new precommit jobs.  It's a bit
of a chicken-and-egg scenario with testing this, so there could be issues
that pop up after the seed jobs are created, but I'll be working to resolve
those issues as quickly as possible.

If you run into problems because of this change, please let me know on the
github PR.

Here's the PR: https://github.com/apache/beam/pull/9642
Here's the Jira: https://issues.apache.org/jira/browse/BEAM-8213#

The upshot is that after this is done you'll get better feedback on python
test failures!

Let me know if you have any concerns.

thanks,
chad


Re: Collecting feedback for Beam usage

2019-09-24 Thread Kenneth Knowles
Agreeing with many things here and my own flavor to the points:
1. User's privacy is more important than anything else
2. The goal should be to make things better for users
3. Trading user's opt-in for functionality (like Gradle scans) is not
acceptable
4. It should be effectively invisible to users who are not interested
5. Ideally, we could find some people with expertise in (a) data gathering
(b) usability (c) privacy (d) whatever we did not think of because it is
not our expertise. So that we have confidence that our results are
meaningful and we have done no harm.

Some obvious data biases have been mentioned. Here's some more: a lot of
Beam usage is probably through automation (such as k8s, cron, Airflow, etc)
where a user is not present when a pipeline is launched. Logging would do
nothing in these cases, except in case of a failure being debugged. I would
guess this is the common case. The more a user is actually using Beam in
production, the less likely they are watching job startup logs. Probably
many companies use Beam to build a platform for their own users, so
analytics may not capture the number of actual users in any meaningful way.
Etc.

Certainly, having a sense of the impact of changes like "deprecate Python
2" or "make breaking change to pipeline options for old FlinkRunner" would
be extremely useful, both to us and to our users. We just need to be
careful. And we must be ready to accept if this is not possible to learn in
an OK way.

I agree with Brian. Download statistics could be a good start for some
broad questions. We could consider tailoring our convenience binaries to
facilitate better data gathering, such as a separate py3 pypi coordinate.
Download stats on released container images could be a way to do this
without inconveniencing users.

Kenn

On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov 
wrote:

> Creating a central place for collecting Beam usage sounds compelling, but
> we'd have to be careful about several aspects:
> - It goes without saying that this can never be on-by-default, even for a
> tiny fraction of pipelines.
> - For further privacy protection, including the user's PipelineOptions is
> probably out of the question too: people might be including very sensitive
> data in their PipelineOptions (such as database passwords) and we wouldn't
> want to end up storing that data even due to a user's mistake. The only
> data that can be stored is data that Beam developers can guarantee is never
> sensitive, or data intentionally authored by a human for the purpose of
> reporting it to us (e.g. a hand-typed feedback message).
> - If it requires the user manually clicking the link, then it will not
> collect data about automated invocations of any pipelines, whereas likely
> almost all practical invocations are automated - the difference between
> COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
> - Moreover, many practical invocations likely go through an intermediate
> library / product, such as scio or talend. There'd need to be a story for
> library developers to offer this capability to their users.
> - The condition "was feedback reported for this pipeline", regardless of
> whether it is reported manually (by clicking the link) or automatically (by
> explicitly enabling some flag), heavily biases the sample - people are
> unlikely to click the link if the pipeline works fine (and almost all
> production pipelines work fine, otherwise they wouldn't be in production),
> and I don't know what considerations would prompt somebody to enable the
> flag for a periodic production pipeline. Meaning, the collected data likely
> can not be reliably used for any aggregation/counting, except for picking
> out interesting individual examples for case studies.
> - Measures should be taken to ensure that people don't accidentally enable
> it in their quick-running direct runner unit tests, causing lots of traffic.
> - I would not dismiss the possibility of spam and attacks.
>
> I'd recommend to start by listing the questions we're hoping to answer
> using the collected feedback, and then judging whether the proposed method
> indeed allows answering them while respecting the users' privacy.
>
> On Tue, Sep 24, 2019 at 1:49 PM Lukasz Cwik  wrote:
>
>> One of the options could be to just display the URL and not to phone
>> home. I would like it so that users can integrate this into their
>> deployment solution so we get regular stats instead of only when a user
>> decides to run a pipeline manually.
>>
>> On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
>> wrote:
>>
>>> I think the goal is to lower the barrier of entry. Displaying a URL to
>>> click on while waiting for your pipeline to start up, that contains
>>> all the data explicitly visible, is about as easy as it gets.
>>> Remembering to run a new (probably not as authentic) pipeline with
>>> that flag is less so.
>>>
>>> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
>>> wrote:
>>> >
>>> > I'm with Luke on this. We 

Re: How to use a locally built worker image?

2019-09-24 Thread Thomas Weise
Hi Hannah,

I believe this is unexpected from the developer perspective. When building
something locally, we do expect that to be used. We may need to change to
not pull when the image is available locally, at least when it is a
snapshot/master branch. Release images should be immutable anyways.

Thomas


On Tue, Sep 24, 2019 at 4:13 PM Hannah Jiang  wrote:

> A minor update, with custom container, the pipeline would not fail, it
> throws out warning and moves on to `docker run` command.
>
> On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang 
> wrote:
>
>> Hi Brian
>>
>> If we pull docker images, it always downloads from remote repository,
>> which is expected behavior.
>> In case we want to run a local image and pull it only when the image is
>> not available at local, we can use `docker run` command directly, without
>> pulling it in advance. [1]
>> In case we want to pull images only when they are not available at local,
>> we can use `docker images -q` to check if images are existing at local
>> before pulling it.
>> Another option is re-tag your local image, pass your image to pipeline
>> and overwrite default one, but the code is still trying to pull, so if your
>> image is not pushed to the remote repository, it would fail.
>>
>> 1. https://github.com/docker/cli/pull/1498
>>
>> Hannah
>>
>> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette 
>> wrote:
>>
>>> I'm working on a demo cross-language pipeline on a local flink cluster
>>> that relies on my python row coder PR [1]. The PR includes some changes to
>>> the Java worker code, so I need to build a Java SDK container locally and
>>> use that in the pipeline.
>>>
>>> Unfortunately, whenever I run the pipeline,
>>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>>> to a newly downloaded image with a creation date 2 weeks ago, and that
>>> image is used instead. It looks like the reason is we run `docker pull`
>>> before running the container [2]. As the comment says this should be a
>>> no-op if the image already exists, but that doesn't seem to be the case. If
>>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>>> downloads the 2 week old image and happily informs me:
>>>
>>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>>
>>> Does anyone know how I can prevent `docker pull` from doing this? I can
>>> unblock myself for now just by commenting out the docker pull command, but
>>> I'd like to understand what is going on here.
>>>
>>> Thanks,
>>> Brian
>>>
>>> [1] https://github.com/apache/beam/pull/9188
>>> [2]
>>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>>
>>


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Robert Bradshaw
Yeah, that's useful. I was asking about getting things at the jenkins
job level. E.g. are our PostCommits taking up all the time, or our
Precommits?

On Tue, Sep 24, 2019 at 1:23 PM Lukasz Cwik  wrote:
>
> We can get the per gradle task profile with the --profile flag: 
> https://jakewharton.com/static/files/trace/profile.html
> This information also appears within the build scans that are sent to Gradle.
>
> Integrating with either of these sources of information would allow us to 
> figure out whether its new tasks or old tasks taking longer.
>
> On Tue, Sep 24, 2019 at 12:23 PM Robert Bradshaw  wrote:
>>
>> Does anyone know how to gather stats on where the time is being spent?
>> Several times the idea of consolidating many of the (expensive)
>> validates runner integration tests into a single pipeline, and then
>> running things individually only if that fails, has come up. I think
>> that'd be a big win if indeed this is where our time is being spent.
>>
>> On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira  
>> wrote:
>> >
>> > Those ideas all sound good. I especially agree with trying to reduce tests 
>> > first and then if we've done all we can there and latency is still too 
>> > high, it means we need more workers. Also in addition to reducing the 
>> > amount of tests, there's also running less important tests less 
>> > frequently, particularly when it comes to postcommits since many of those 
>> > are resource intensive. That would require people with good context around 
>> > what our many postcommits are used for.
>> >
>> > Another idea I thought of is trying to avoid running automated tests 
>> > outside of peak coding times. Ideally, during the times when we get the 
>> > greatest amounts of PRs (and therefore precommits) we shouldn't have any 
>> > postcommits running. If we have both pre and postcommits going at the same 
>> > time during peak hours, our queue times will shoot up even if the total 
>> > amount of work doesn't change much.
>> >
>> > Btw, you mentioned that this was a problem last year. Do you have any 
>> > links to discussions about that? It seems like it could be useful.
>> >
>> > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin  
>> > wrote:
>> >>
>> >> Hi Daniel,
>> >>
>> >> Generally this looks feasible since jobs wait for new worker to be 
>> >> available to start.
>> >>
>> >> Over time we added more tests and did not deprecate enough, this 
>> >> increases load on workers. I wonder if we can add something like total 
>> >> runtime of all running jobs? This will be a safeguard metric that will 
>> >> show amount of time we actually run jobs. If it increases with same 
>> >> amount of workers, that will prove that we are overloading them (inverse 
>> >> is not necessarily correct).
>> >>
>> >> On addressing this, we can review approaches we took last year and see if 
>> >> any of them apply. If I do some brainstorming, following ideas come to 
>> >> mind: add more work force, reduce amount of tests, do better work on 
>> >> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests 
>> >> if linter fails) and/or add option for cancelling irrelevant jobs. One 
>> >> more big point can be effort on deflaking, but we seem to be decent in 
>> >> this area.
>> >>
>> >> Regards,
>> >> Mikhail.
>> >>
>> >>
>> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira  
>> >> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> A little while ago I was taking a look at the Precommit Latency metrics 
>> >>> on Grafana (link) and saw that the monthly 90th percentile metric has 
>> >>> been really increasing the past few months, from around 10 minutes to 
>> >>> currently around 30 minutes.
>> >>>
>> >>> After doing some light digging I was shown this page (beam load 
>> >>> statistics) which seems to imply that queue times are shooting up when 
>> >>> all the test executors are occupied, and it seems this is happening 
>> >>> longer and more often recently. I also took a look at the commit history 
>> >>> for our Jenkins tests and I see that new tests have steadily been added.
>> >>>
>> >>> I wanted to bring this up with the dev@ to ask:
>> >>>
>> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does 
>> >>> anyone know how to double check my assumptions with more concrete 
>> >>> metrics?
>> >>>
>> >>> 2. Does anyone have ideas on how to address this?
>> >>>
>> >>> Thanks,
>> >>> Daniel Oliveira


Re: Collecting feedback for Beam usage

2019-09-24 Thread Eugene Kirpichov
Creating a central place for collecting Beam usage sounds compelling, but
we'd have to be careful about several aspects:
- It goes without saying that this can never be on-by-default, even for a
tiny fraction of pipelines.
- For further privacy protection, including the user's PipelineOptions is
probably out of the question too: people might be including very sensitive
data in their PipelineOptions (such as database passwords) and we wouldn't
want to end up storing that data even due to a user's mistake. The only
data that can be stored is data that Beam developers can guarantee is never
sensitive, or data intentionally authored by a human for the purpose of
reporting it to us (e.g. a hand-typed feedback message).
- If it requires the user manually clicking the link, then it will not
collect data about automated invocations of any pipelines, whereas likely
almost all practical invocations are automated - the difference between
COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
- Moreover, many practical invocations likely go through an intermediate
library / product, such as scio or talend. There'd need to be a story for
library developers to offer this capability to their users.
- The condition "was feedback reported for this pipeline", regardless of
whether it is reported manually (by clicking the link) or automatically (by
explicitly enabling some flag), heavily biases the sample - people are
unlikely to click the link if the pipeline works fine (and almost all
production pipelines work fine, otherwise they wouldn't be in production),
and I don't know what considerations would prompt somebody to enable the
flag for a periodic production pipeline. Meaning, the collected data likely
can not be reliably used for any aggregation/counting, except for picking
out interesting individual examples for case studies.
- Measures should be taken to ensure that people don't accidentally enable
it in their quick-running direct runner unit tests, causing lots of traffic.
- I would not dismiss the possibility of spam and attacks.

I'd recommend to start by listing the questions we're hoping to answer
using the collected feedback, and then judging whether the proposed method
indeed allows answering them while respecting the users' privacy.

On Tue, Sep 24, 2019 at 1:49 PM Lukasz Cwik  wrote:

> One of the options could be to just display the URL and not to phone home.
> I would like it so that users can integrate this into their deployment
> solution so we get regular stats instead of only when a user decides to run
> a pipeline manually.
>
> On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
> wrote:
>
>> I think the goal is to lower the barrier of entry. Displaying a URL to
>> click on while waiting for your pipeline to start up, that contains
>> all the data explicitly visible, is about as easy as it gets.
>> Remembering to run a new (probably not as authentic) pipeline with
>> that flag is less so.
>>
>> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
>> wrote:
>> >
>> > I'm with Luke on this. We can add a set of flags to send home stats and
>> crash dumps if user agrees. If we keep code isolated, it will be easy
>> enough for user to check what is being sent.
>> >
>> > One more heavy-weight option is to also allow user configure and
>> persist what information he is ok with sharing.
>> >
>> > --Mikhail
>> >
>> >
>> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:
>> >>
>> >> Why not add a flag to the SDK that would do the phone home when
>> specified?
>> >>
>> >> From a support perspective it would be useful to know:
>> >> * SDK version
>> >> * Runner
>> >> * SDK provided PTransforms that are used
>> >> * Features like user state/timers/side inputs/splittable dofns/...
>> >> * Graph complexity (# nodes, # branches, ...)
>> >> * Pipeline failed or succeeded
>> >>
>> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
>> wrote:
>> >>> >
>> >>> > Would people actually click on that link though? I think Kyle has a
>> point that in practice users would only find and click on that link when
>> they're having some kind of issue, especially if the link has "feedback" in
>> it.
>> >>>
>> >>> I think the idea is that we would make the link very light-weight,
>> >>> kind of like a survey (but even easier as it's pre-populated).
>> >>> Basically an opt-in phone-home. If we don't collect any personal data
>> >>> (not even IP/geo, just (say) version + runner, all visible in the
>> >>> URL), no need to guard/anonymize (and this may be sufficient--I don't
>> >>> think we have to worry about spammers and ballot stuffers given the
>> >>> target audience). If we can catch people while they wait for their
>> >>> pipeline to start up (and/or complete), this is a great time to get
>> >>> some feedback.
>> >>>
>> >>> > I agree usage data would be really valuable, but I'm not sure that
>> this approach would get us good data. Is there a way to get download

Re: How to use a locally built worker image?

2019-09-24 Thread Hannah Jiang
A minor update, with custom container, the pipeline would not fail, it
throws out warning and moves on to `docker run` command.

On Tue, Sep 24, 2019 at 4:05 PM Hannah Jiang  wrote:

> Hi Brian
>
> If we pull docker images, it always downloads from remote repository,
> which is expected behavior.
> In case we want to run a local image and pull it only when the image is
> not available at local, we can use `docker run` command directly, without
> pulling it in advance. [1]
> In case we want to pull images only when they are not available at local,
> we can use `docker images -q` to check if images are existing at local
> before pulling it.
> Another option is re-tag your local image, pass your image to pipeline and
> overwrite default one, but the code is still trying to pull, so if your
> image is not pushed to the remote repository, it would fail.
>
> 1. https://github.com/docker/cli/pull/1498
>
> Hannah
>
> On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette 
> wrote:
>
>> I'm working on a demo cross-language pipeline on a local flink cluster
>> that relies on my python row coder PR [1]. The PR includes some changes to
>> the Java worker code, so I need to build a Java SDK container locally and
>> use that in the pipeline.
>>
>> Unfortunately, whenever I run the pipeline,
>> the apachebeam/java_sdk:latest tag is moved off of my locally built image
>> to a newly downloaded image with a creation date 2 weeks ago, and that
>> image is used instead. It looks like the reason is we run `docker pull`
>> before running the container [2]. As the comment says this should be a
>> no-op if the image already exists, but that doesn't seem to be the case. If
>> I just run `docker pull apachebeam/java_sdk:latest` on my local machine it
>> downloads the 2 week old image and happily informs me:
>>
>> Status: Downloaded newer image for apachebeam/java_sdk:latest
>>
>> Does anyone know how I can prevent `docker pull` from doing this? I can
>> unblock myself for now just by commenting out the docker pull command, but
>> I'd like to understand what is going on here.
>>
>> Thanks,
>> Brian
>>
>> [1] https://github.com/apache/beam/pull/9188
>> [2]
>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>>
>


Re: How to use a locally built worker image?

2019-09-24 Thread Hannah Jiang
Hi Brian

If we pull docker images, it always downloads from remote repository, which
is expected behavior.
In case we want to run a local image and pull it only when the image is not
available at local, we can use `docker run` command directly, without
pulling it in advance. [1]
In case we want to pull images only when they are not available at local,
we can use `docker images -q` to check if images are existing at local
before pulling it.
Another option is re-tag your local image, pass your image to pipeline and
overwrite default one, but the code is still trying to pull, so if your
image is not pushed to the remote repository, it would fail.

1. https://github.com/docker/cli/pull/1498

Hannah

On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette  wrote:

> I'm working on a demo cross-language pipeline on a local flink cluster
> that relies on my python row coder PR [1]. The PR includes some changes to
> the Java worker code, so I need to build a Java SDK container locally and
> use that in the pipeline.
>
> Unfortunately, whenever I run the pipeline, the apachebeam/java_sdk:latest
> tag is moved off of my locally built image to a newly downloaded image with
> a creation date 2 weeks ago, and that image is used instead. It looks like
> the reason is we run `docker pull` before running the container [2]. As the
> comment says this should be a no-op if the image already exists, but that
> doesn't seem to be the case. If I just run `docker pull
> apachebeam/java_sdk:latest` on my local machine it downloads the 2 week old
> image and happily informs me:
>
> Status: Downloaded newer image for apachebeam/java_sdk:latest
>
> Does anyone know how I can prevent `docker pull` from doing this? I can
> unblock myself for now just by commenting out the docker pull command, but
> I'd like to understand what is going on here.
>
> Thanks,
> Brian
>
> [1] https://github.com/apache/beam/pull/9188
> [2]
> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>


Re: How to use a locally built worker image?

2019-09-24 Thread Kyle Weaver
Looks like this is Docker's expected behavior [1]. For Python, this isn't a
problem because the default tag is `2.17.0.dev`, not `latest`. Not sure why
that's different for Java? Anyway, according to this comment:

'Note that docker pull will not *overwrite* your local image, it will
"untag" it (i.e. remove the name and :latest tag from the old image and put
it on the new one)'
I imagine you should be able to get around this without changing source
code by setting pipeline option `--environment_config=$IMAGE_ID`. Where
$IMAGE_ID is the hash of your local image.

[1] https://github.com/moby/moby/issues/10291

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com


On Tue, Sep 24, 2019 at 11:56 AM Brian Hulette  wrote:

> I'm working on a demo cross-language pipeline on a local flink cluster
> that relies on my python row coder PR [1]. The PR includes some changes to
> the Java worker code, so I need to build a Java SDK container locally and
> use that in the pipeline.
>
> Unfortunately, whenever I run the pipeline, the apachebeam/java_sdk:latest
> tag is moved off of my locally built image to a newly downloaded image with
> a creation date 2 weeks ago, and that image is used instead. It looks like
> the reason is we run `docker pull` before running the container [2]. As the
> comment says this should be a no-op if the image already exists, but that
> doesn't seem to be the case. If I just run `docker pull
> apachebeam/java_sdk:latest` on my local machine it downloads the 2 week old
> image and happily informs me:
>
> Status: Downloaded newer image for apachebeam/java_sdk:latest
>
> Does anyone know how I can prevent `docker pull` from doing this? I can
> unblock myself for now just by commenting out the docker pull command, but
> I'd like to understand what is going on here.
>
> Thanks,
> Brian
>
> [1] https://github.com/apache/beam/pull/9188
> [2]
> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80
>


Re: Collecting feedback for Beam usage

2019-09-24 Thread Lukasz Cwik
One of the options could be to just display the URL and not to phone home.
I would like it so that users can integrate this into their deployment
solution so we get regular stats instead of only when a user decides to run
a pipeline manually.

On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
wrote:

> I think the goal is to lower the barrier of entry. Displaying a URL to
> click on while waiting for your pipeline to start up, that contains
> all the data explicitly visible, is about as easy as it gets.
> Remembering to run a new (probably not as authentic) pipeline with
> that flag is less so.
>
> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
> wrote:
> >
> > I'm with Luke on this. We can add a set of flags to send home stats and
> crash dumps if user agrees. If we keep code isolated, it will be easy
> enough for user to check what is being sent.
> >
> > One more heavy-weight option is to also allow user configure and persist
> what information he is ok with sharing.
> >
> > --Mikhail
> >
> >
> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:
> >>
> >> Why not add a flag to the SDK that would do the phone home when
> specified?
> >>
> >> From a support perspective it would be useful to know:
> >> * SDK version
> >> * Runner
> >> * SDK provided PTransforms that are used
> >> * Features like user state/timers/side inputs/splittable dofns/...
> >> * Graph complexity (# nodes, # branches, ...)
> >> * Pipeline failed or succeeded
> >>
> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
> wrote:
> >>>
> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
> wrote:
> >>> >
> >>> > Would people actually click on that link though? I think Kyle has a
> point that in practice users would only find and click on that link when
> they're having some kind of issue, especially if the link has "feedback" in
> it.
> >>>
> >>> I think the idea is that we would make the link very light-weight,
> >>> kind of like a survey (but even easier as it's pre-populated).
> >>> Basically an opt-in phone-home. If we don't collect any personal data
> >>> (not even IP/geo, just (say) version + runner, all visible in the
> >>> URL), no need to guard/anonymize (and this may be sufficient--I don't
> >>> think we have to worry about spammers and ballot stuffers given the
> >>> target audience). If we can catch people while they wait for their
> >>> pipeline to start up (and/or complete), this is a great time to get
> >>> some feedback.
> >>>
> >>> > I agree usage data would be really valuable, but I'm not sure that
> this approach would get us good data. Is there a way to get download
> statistics for the different runner artifacts? Maybe that could be a better
> metric to compare usage.
> >>>
> >>> This'd be useful too, but hard to get and very noisy.
> >>>
> >>> >
> >>> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka 
> wrote:
> >>> >>
> >>> >> I agree, these are the questions that need to be answered.
> >>> >> The data can be anonymize and stored as public data in BigQuery or
> some other place.
> >>> >>
> >>> >> The intent is to get the usage statistics so that we can get to
> know what people are using Flink or Spark etc and not intended for
> discussion or a help channel.
> >>> >> I also think that we don't need to monitor this actively as it's
> more like a survey rather than active channel to get issues resolved.
> >>> >>
> >>> >> If we think its useful for the community then we come up with the
> solution as to how can we do this (similar to how we released the container
> images).
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
> wrote:
> >>> >>>
> >>> >>> There are some logistics that would need worked out. For example,
> Where would the data go? Who would own it?
> >>> >>>
> >>> >>> Also, I'm not convinced we need yet another place to discuss Beam
> when we already have discussed the challenge of simultaneously monitoring
> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
> certainly an interesting question, and I'd be curious to know that >= X
> many people use a certain runner, I'm not sure answers to these questions
> are as useful for guiding the future of Beam as discussions on the
> dev/users lists, etc. as the latter likely result in more depth/specific
> feedback.
> >>> >>>
> >>> >>> However, I do think it could be useful in general to include links
> directly in the console output. For example, maybe something along the
> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
> mailing list."
> >>> >>>
> >>> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>> >>>
> >>> >>>
> >>> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka 
> wrote:
> >>> 
> >>>  Hi,
> >>> 
> >>>  At the moment we don't really have a good way to collect any
> usage statistics for Apache Beam. Like runner used etc. As many of the
> users don't really have a way to report their usecase.
> >>>  How about if we 

Re: Do we know why gradle scans are not working?

2019-09-24 Thread Lukasz Cwik
Not to my knowledge. Maybe something is down.

Have you tried running a gradle build locally with --scan?

On Tue, Sep 24, 2019 at 1:03 PM Valentyn Tymofieiev 
wrote:

> For example, https://gradle.com/s/mpfu3wpz2xfwe  says: Your build scan
> could not be displayed.
>


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Lukasz Cwik
We can get the per gradle task profile with the --profile flag:
https://jakewharton.com/static/files/trace/profile.html
This information also appears within the build scans that are sent to
Gradle.

Integrating with either of these sources of information would allow us to
figure out whether its new tasks or old tasks taking longer.

On Tue, Sep 24, 2019 at 12:23 PM Robert Bradshaw 
wrote:

> Does anyone know how to gather stats on where the time is being spent?
> Several times the idea of consolidating many of the (expensive)
> validates runner integration tests into a single pipeline, and then
> running things individually only if that fails, has come up. I think
> that'd be a big win if indeed this is where our time is being spent.
>
> On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira 
> wrote:
> >
> > Those ideas all sound good. I especially agree with trying to reduce
> tests first and then if we've done all we can there and latency is still
> too high, it means we need more workers. Also in addition to reducing the
> amount of tests, there's also running less important tests less frequently,
> particularly when it comes to postcommits since many of those are resource
> intensive. That would require people with good context around what our many
> postcommits are used for.
> >
> > Another idea I thought of is trying to avoid running automated tests
> outside of peak coding times. Ideally, during the times when we get the
> greatest amounts of PRs (and therefore precommits) we shouldn't have any
> postcommits running. If we have both pre and postcommits going at the same
> time during peak hours, our queue times will shoot up even if the total
> amount of work doesn't change much.
> >
> > Btw, you mentioned that this was a problem last year. Do you have any
> links to discussions about that? It seems like it could be useful.
> >
> > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin 
> wrote:
> >>
> >> Hi Daniel,
> >>
> >> Generally this looks feasible since jobs wait for new worker to be
> available to start.
> >>
> >> Over time we added more tests and did not deprecate enough, this
> increases load on workers. I wonder if we can add something like total
> runtime of all running jobs? This will be a safeguard metric that will show
> amount of time we actually run jobs. If it increases with same amount of
> workers, that will prove that we are overloading them (inverse is not
> necessarily correct).
> >>
> >> On addressing this, we can review approaches we took last year and see
> if any of them apply. If I do some brainstorming, following ideas come to
> mind: add more work force, reduce amount of tests, do better work on
> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests if
> linter fails) and/or add option for cancelling irrelevant jobs. One more
> big point can be effort on deflaking, but we seem to be decent in this area.
> >>
> >> Regards,
> >> Mikhail.
> >>
> >>
> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <
> danolive...@google.com> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> A little while ago I was taking a look at the Precommit Latency
> metrics on Grafana (link) and saw that the monthly 90th percentile metric
> has been really increasing the past few months, from around 10 minutes to
> currently around 30 minutes.
> >>>
> >>> After doing some light digging I was shown this page (beam load
> statistics) which seems to imply that queue times are shooting up when all
> the test executors are occupied, and it seems this is happening longer and
> more often recently. I also took a look at the commit history for our
> Jenkins tests and I see that new tests have steadily been added.
> >>>
> >>> I wanted to bring this up with the dev@ to ask:
> >>>
> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does
> anyone know how to double check my assumptions with more concrete metrics?
> >>>
> >>> 2. Does anyone have ideas on how to address this?
> >>>
> >>> Thanks,
> >>> Daniel Oliveira
>


Do we know why gradle scans are not working?

2019-09-24 Thread Valentyn Tymofieiev
For example, https://gradle.com/s/mpfu3wpz2xfwe  says: Your build scan
could not be displayed.


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Robert Bradshaw
Does anyone know how to gather stats on where the time is being spent?
Several times the idea of consolidating many of the (expensive)
validates runner integration tests into a single pipeline, and then
running things individually only if that fails, has come up. I think
that'd be a big win if indeed this is where our time is being spent.

On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira  wrote:
>
> Those ideas all sound good. I especially agree with trying to reduce tests 
> first and then if we've done all we can there and latency is still too high, 
> it means we need more workers. Also in addition to reducing the amount of 
> tests, there's also running less important tests less frequently, 
> particularly when it comes to postcommits since many of those are resource 
> intensive. That would require people with good context around what our many 
> postcommits are used for.
>
> Another idea I thought of is trying to avoid running automated tests outside 
> of peak coding times. Ideally, during the times when we get the greatest 
> amounts of PRs (and therefore precommits) we shouldn't have any postcommits 
> running. If we have both pre and postcommits going at the same time during 
> peak hours, our queue times will shoot up even if the total amount of work 
> doesn't change much.
>
> Btw, you mentioned that this was a problem last year. Do you have any links 
> to discussions about that? It seems like it could be useful.
>
> On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin  wrote:
>>
>> Hi Daniel,
>>
>> Generally this looks feasible since jobs wait for new worker to be available 
>> to start.
>>
>> Over time we added more tests and did not deprecate enough, this increases 
>> load on workers. I wonder if we can add something like total runtime of all 
>> running jobs? This will be a safeguard metric that will show amount of time 
>> we actually run jobs. If it increases with same amount of workers, that will 
>> prove that we are overloading them (inverse is not necessarily correct).
>>
>> On addressing this, we can review approaches we took last year and see if 
>> any of them apply. If I do some brainstorming, following ideas come to mind: 
>> add more work force, reduce amount of tests, do better work on filtering out 
>> irrelevant tests, cancel irrelevant jobs (ie: cancel tests if linter fails) 
>> and/or add option for cancelling irrelevant jobs. One more big point can be 
>> effort on deflaking, but we seem to be decent in this area.
>>
>> Regards,
>> Mikhail.
>>
>>
>> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira  
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> A little while ago I was taking a look at the Precommit Latency metrics on 
>>> Grafana (link) and saw that the monthly 90th percentile metric has been 
>>> really increasing the past few months, from around 10 minutes to currently 
>>> around 30 minutes.
>>>
>>> After doing some light digging I was shown this page (beam load statistics) 
>>> which seems to imply that queue times are shooting up when all the test 
>>> executors are occupied, and it seems this is happening longer and more 
>>> often recently. I also took a look at the commit history for our Jenkins 
>>> tests and I see that new tests have steadily been added.
>>>
>>> I wanted to bring this up with the dev@ to ask:
>>>
>>> 1. Is this accurate? Can anyone provide insight into the metrics? Does 
>>> anyone know how to double check my assumptions with more concrete metrics?
>>>
>>> 2. Does anyone have ideas on how to address this?
>>>
>>> Thanks,
>>> Daniel Oliveira


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Daniel Oliveira
Those ideas all sound good. I especially agree with trying to reduce tests
first and then if we've done all we can there and latency is still too
high, it means we need more workers. Also in addition to reducing the
amount of tests, there's also running less important tests less frequently,
particularly when it comes to postcommits since many of those are resource
intensive. That would require people with good context around what our many
postcommits are used for.

Another idea I thought of is trying to avoid running automated tests
outside of peak coding times. Ideally, during the times when we get the
greatest amounts of PRs (and therefore precommits) we shouldn't have any
postcommits running. If we have both pre and postcommits going at the same
time during peak hours, our queue times will shoot up even if the total
amount of work doesn't change much.

Btw, you mentioned that this was a problem last year. Do you have any links
to discussions about that? It seems like it could be useful.

On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin  wrote:

> Hi Daniel,
>
> Generally this looks feasible since jobs wait for new worker to be
> available to start.
>
> Over time we added more tests and did not deprecate enough, this increases
> load on workers. I wonder if we can add something like total runtime of all
> running jobs? This will be a safeguard metric that will show amount of time
> we actually run jobs. If it increases with same amount of workers, that
> will prove that we are overloading them (inverse is not necessarily
> correct).
>
> On addressing this, we can review approaches we took last year and see if
> any of them apply. If I do some brainstorming, following ideas come to
> mind: add more work force, reduce amount of tests, do better work on
> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests if
> linter fails) and/or add option for cancelling irrelevant jobs. One more
> big point can be effort on deflaking, but we seem to be decent in this area.
>
> Regards,
> Mikhail.
>
>
> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira 
> wrote:
>
>> Hi everyone,
>>
>> A little while ago I was taking a look at the Precommit Latency metrics
>> on Grafana (link
>> )
>> and saw that the monthly 90th percentile metric has been really increasing
>> the past few months, from around 10 minutes to currently around 30 minutes.
>>
>> After doing some light digging I was shown this page (beam load
>> statistics
>> ) which
>> seems to imply that queue times are shooting up when all the test executors
>> are occupied, and it seems this is happening longer and more often
>> recently. I also took a look at the commit history for our Jenkins tests
>> 
>>  and
>> I see that new tests have steadily been added.
>>
>> I wanted to bring this up with the dev@ to ask:
>>
>> 1. Is this accurate? Can anyone provide insight into the metrics? Does
>> anyone know how to double check my assumptions with more concrete metrics?
>>
>> 2. Does anyone have ideas on how to address this?
>>
>> Thanks,
>> Daniel Oliveira
>>
>


How to use a locally built worker image?

2019-09-24 Thread Brian Hulette
I'm working on a demo cross-language pipeline on a local flink cluster that
relies on my python row coder PR [1]. The PR includes some changes to the
Java worker code, so I need to build a Java SDK container locally and use
that in the pipeline.

Unfortunately, whenever I run the pipeline, the apachebeam/java_sdk:latest
tag is moved off of my locally built image to a newly downloaded image with
a creation date 2 weeks ago, and that image is used instead. It looks like
the reason is we run `docker pull` before running the container [2]. As the
comment says this should be a no-op if the image already exists, but that
doesn't seem to be the case. If I just run `docker pull
apachebeam/java_sdk:latest` on my local machine it downloads the 2 week old
image and happily informs me:

Status: Downloaded newer image for apachebeam/java_sdk:latest

Does anyone know how I can prevent `docker pull` from doing this? I can
unblock myself for now just by commenting out the docker pull command, but
I'd like to understand what is going on here.

Thanks,
Brian

[1] https://github.com/apache/beam/pull/9188
[2]
https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/DockerCommand.java#L80


Re: Collecting feedback for Beam usage

2019-09-24 Thread Mikhail Gryzykhin
I'm with Luke on this. We can add a set of flags to send home stats and
crash dumps if user agrees. If we keep code isolated, it will be easy
enough for user to check what is being sent.

One more heavy-weight option is to also allow user configure and persist
what information he is ok with sharing.

--Mikhail


On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:

> Why not add a flag to the SDK that would do the phone home when specified?
>
> From a support perspective it would be useful to know:
> * SDK version
> * Runner
> * SDK provided PTransforms that are used
> * Features like user state/timers/side inputs/splittable dofns/...
> * Graph complexity (# nodes, # branches, ...)
> * Pipeline failed or succeeded
>
> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
> wrote:
>
>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
>> wrote:
>> >
>> > Would people actually click on that link though? I think Kyle has a
>> point that in practice users would only find and click on that link when
>> they're having some kind of issue, especially if the link has "feedback" in
>> it.
>>
>> I think the idea is that we would make the link very light-weight,
>> kind of like a survey (but even easier as it's pre-populated).
>> Basically an opt-in phone-home. If we don't collect any personal data
>> (not even IP/geo, just (say) version + runner, all visible in the
>> URL), no need to guard/anonymize (and this may be sufficient--I don't
>> think we have to worry about spammers and ballot stuffers given the
>> target audience). If we can catch people while they wait for their
>> pipeline to start up (and/or complete), this is a great time to get
>> some feedback.
>>
>> > I agree usage data would be really valuable, but I'm not sure that this
>> approach would get us good data. Is there a way to get download statistics
>> for the different runner artifacts? Maybe that could be a better metric to
>> compare usage.
>>
>> This'd be useful too, but hard to get and very noisy.
>>
>> >
>> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:
>> >>
>> >> I agree, these are the questions that need to be answered.
>> >> The data can be anonymize and stored as public data in BigQuery or
>> some other place.
>> >>
>> >> The intent is to get the usage statistics so that we can get to know
>> what people are using Flink or Spark etc and not intended for discussion or
>> a help channel.
>> >> I also think that we don't need to monitor this actively as it's more
>> like a survey rather than active channel to get issues resolved.
>> >>
>> >> If we think its useful for the community then we come up with the
>> solution as to how can we do this (similar to how we released the container
>> images).
>> >>
>> >>
>> >>
>> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
>> wrote:
>> >>>
>> >>> There are some logistics that would need worked out. For example,
>> Where would the data go? Who would own it?
>> >>>
>> >>> Also, I'm not convinced we need yet another place to discuss Beam
>> when we already have discussed the challenge of simultaneously monitoring
>> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
>> certainly an interesting question, and I'd be curious to know that >= X
>> many people use a certain runner, I'm not sure answers to these questions
>> are as useful for guiding the future of Beam as discussions on the
>> dev/users lists, etc. as the latter likely result in more depth/specific
>> feedback.
>> >>>
>> >>> However, I do think it could be useful in general to include links
>> directly in the console output. For example, maybe something along the
>> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
>> mailing list."
>> >>>
>> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com
>> >>>
>> >>>
>> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka 
>> wrote:
>> 
>>  Hi,
>> 
>>  At the moment we don't really have a good way to collect any usage
>> statistics for Apache Beam. Like runner used etc. As many of the users
>> don't really have a way to report their usecase.
>>  How about if we create a feedback page where users can add their
>> pipeline details and usecase.
>>  Also, we can start printing the link to this page when user launch
>> the pipeline in the command line.
>>  Example:
>>  $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
>> 
>>  Starting pipeline
>>  Please use
>> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>>  Pipeline started
>>  ..
>> 
>>  Using a link and not publishing the data automatically will give
>> user control over what they publish and what they don't. We can enhance the
>> text and usage further but the basic idea is to ask for user feeback at
>> each run of the pipeline.
>>  Let me know what you think.
>> 
>> 
>>  Thanks,
>>  Ankur
>>
>


Re: Collecting feedback for Beam usage

2019-09-24 Thread Lukasz Cwik
Why not add a flag to the SDK that would do the phone home when specified?

>From a support perspective it would be useful to know:
* SDK version
* Runner
* SDK provided PTransforms that are used
* Features like user state/timers/side inputs/splittable dofns/...
* Graph complexity (# nodes, # branches, ...)
* Pipeline failed or succeeded

On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw  wrote:

> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette  wrote:
> >
> > Would people actually click on that link though? I think Kyle has a
> point that in practice users would only find and click on that link when
> they're having some kind of issue, especially if the link has "feedback" in
> it.
>
> I think the idea is that we would make the link very light-weight,
> kind of like a survey (but even easier as it's pre-populated).
> Basically an opt-in phone-home. If we don't collect any personal data
> (not even IP/geo, just (say) version + runner, all visible in the
> URL), no need to guard/anonymize (and this may be sufficient--I don't
> think we have to worry about spammers and ballot stuffers given the
> target audience). If we can catch people while they wait for their
> pipeline to start up (and/or complete), this is a great time to get
> some feedback.
>
> > I agree usage data would be really valuable, but I'm not sure that this
> approach would get us good data. Is there a way to get download statistics
> for the different runner artifacts? Maybe that could be a better metric to
> compare usage.
>
> This'd be useful too, but hard to get and very noisy.
>
> >
> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:
> >>
> >> I agree, these are the questions that need to be answered.
> >> The data can be anonymize and stored as public data in BigQuery or some
> other place.
> >>
> >> The intent is to get the usage statistics so that we can get to know
> what people are using Flink or Spark etc and not intended for discussion or
> a help channel.
> >> I also think that we don't need to monitor this actively as it's more
> like a survey rather than active channel to get issues resolved.
> >>
> >> If we think its useful for the community then we come up with the
> solution as to how can we do this (similar to how we released the container
> images).
> >>
> >>
> >>
> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
> wrote:
> >>>
> >>> There are some logistics that would need worked out. For example,
> Where would the data go? Who would own it?
> >>>
> >>> Also, I'm not convinced we need yet another place to discuss Beam when
> we already have discussed the challenge of simultaneously monitoring
> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
> certainly an interesting question, and I'd be curious to know that >= X
> many people use a certain runner, I'm not sure answers to these questions
> are as useful for guiding the future of Beam as discussions on the
> dev/users lists, etc. as the latter likely result in more depth/specific
> feedback.
> >>>
> >>> However, I do think it could be useful in general to include links
> directly in the console output. For example, maybe something along the
> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
> mailing list."
> >>>
> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>>
> >>>
> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka 
> wrote:
> 
>  Hi,
> 
>  At the moment we don't really have a good way to collect any usage
> statistics for Apache Beam. Like runner used etc. As many of the users
> don't really have a way to report their usecase.
>  How about if we create a feedback page where users can add their
> pipeline details and usecase.
>  Also, we can start printing the link to this page when user launch
> the pipeline in the command line.
>  Example:
>  $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
> 
>  Starting pipeline
>  Please use
> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>  Pipeline started
>  ..
> 
>  Using a link and not publishing the data automatically will give user
> control over what they publish and what they don't. We can enhance the text
> and usage further but the basic idea is to ask for user feeback at each run
> of the pipeline.
>  Let me know what you think.
> 
> 
>  Thanks,
>  Ankur
>


Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-24 Thread Alexey Romanenko
I don’t see any updates on my calendar.  Does it work for others?

> On 19 Sep 2019, at 17:16, Ismaël Mejía  wrote:
> 
> 25/09 looks ok. I just updated the meeting invitation to the new
> date.I will prepare a mini agenda in the shared minute document in the
> meantime.
> I cannot see the old invitees, can someone please confirm me they see
> the date updated.
> Thanks,
> Ismaël
> 
> On Thu, Sep 19, 2019 at 2:13 PM Etienne Chauchot  wrote:
>> 
>> Hi Rahul and Xinyu,
>> I just added you to the list of guests in the meeting. Time is 5pm GMT +2.
>> That being said, for some reason last meeting scheduled was 08/28. Ismael 
>> initially created the meeting, I do not have the rights to add a new date. 
>> Ismael can you add a date ? I suggest 09/25. WDYT ?
>> 
>> Best
>> Etienne
>> 
>> Le jeudi 19 septembre 2019 à 00:49 +0530, rahul patwari a écrit :
>> 
>> Hi,
>> 
>> I would love to join the call.
>> Can you also share the meeting invitation with me?
>> 
>> Thanks,
>> Rahul
>> 
>> On Wed 18 Sep, 2019, 11:48 PM Xinyu Liu,  wrote:
>> 
>> Alexey and Etienne: I'm very happy to join the sync-up meeting. Please 
>> forward the meeting info to me. I am based in California, US and hopefully 
>> the time will work :).
>> 
>> Thanks,
>> Xinyu
>> 
>> On Wed, Sep 18, 2019 at 6:39 AM Etienne Chauchot  
>> wrote:
>> 
>> Hi Xinyu,
>> 
>> Thanks for offering help ! My comments are inline:
>> 
>> Le vendredi 13 septembre 2019 à 12:16 -0700, Xinyu Liu a écrit :
>> 
>> Hi, Etienne,
>> 
>> The slides are very informative! Thanks for sharing the details about how 
>> the Beam API are mapped into Spark Structural Streaming.
>> 
>> 
>> Thanks !
>> 
>> We (LinkedIn) are also interested in trying the new SparkRunner to run Beam 
>> pipeine in batch, and contribute to it too. From my understanding, seems the 
>> functionality on batch side is mostly complete and covers quite a large 
>> percentage of the tests (a few missing pieces like state and timer in ParDo 
>> and SDF).
>> 
>> 
>> Correct, it passes 89% of the tests, but there is more than SDF, state and 
>> timer missing, there is also ongoing encoders work that I would like to 
>> commit/push before merging.
>> 
>> If so, is it possible to merge the new runner sooner into master so it's 
>> much easier for us to pull it in (we have an internal fork) and contribute 
>> back?
>> 
>> 
>> Sure, see my other mail on this thread. As Alexey mentioned, please join the 
>> sync meeting we have, the more the merrier !
>> 
>> 
>> Also curious about the scheme part in the runner. Seems we can leverage the 
>> schema-aware work in PCollection and translate from Beam schema to Spark, so 
>> it can be optimized in the planner layer. It will be great to hear back your 
>> plans on that.
>> 
>> 
>> Well, it is not designed yet but, if you remember my talk, we need to store 
>> beam windowing information with the data itself, so ending up having a 
>> dataset . One lead that was discussed is to store it as a 
>> Spark schema such as this:
>> 
>> 1. field1: binary data for beam windowing information (cannot be mapped to 
>> fields because beam windowing info is complex structure)
>> 
>> 2. fields of data as defined in the Beam schema if there is one
>> 
>> 
>> Congrats on this great work!
>> 
>> Thanks !
>> 
>> Best,
>> 
>> Etienne
>> 
>> Thanks,
>> Xinyu
>> 
>> On Wed, Sep 11, 2019 at 6:02 PM Rui Wang  wrote:
>> 
>> Hello Etienne,
>> 
>> Your slide mentioned that streaming mode development is blocked because 
>> Spark lacks supporting multiple-aggregations in its streaming mode but 
>> design is ongoing. Do you have a link or something else to their design 
>> discussion/doc?
>> 
>> 
>> -Rui
>> 
>> On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot  
>> wrote:
>> 
>> Hi Rahul,
>> Sure, and great ! Thanks for proposing !
>> If you want details, here is the presentation I did 30 mins ago at the 
>> apachecon. You will find the video on youtube shortly but in the meantime, 
>> here is my presentation slides.
>> 
>> And here is the structured streaming branch. I'll be happy to review your 
>> PRs, thanks !
>> 
>> https://github.com/apache/beam/tree/spark-runner_structured-streaming
>> 
>> Best
>> Etienne
>> 
>> Le mercredi 11 septembre 2019 à 16:37 +0530, rahul patwari a écrit :
>> 
>> Hi Etienne,
>> 
>> I came to know about the work going on in Structured Streaming Spark Runner 
>> from Apache Beam Wiki - Works in Progress.
>> I have contributed to BeamSql earlier. And I am working on supporting 
>> PCollectionView in BeamSql.
>> 
>> I would love to understand the Runner's side of Apache Beam and contribute 
>> to the Structured Streaming Spark Runner.
>> 
>> Can you please point me in the right direction?
>> 
>> Thanks,
>> Rahul



Re: contributor permission for Beam Jira tickets

2019-09-24 Thread Ismaël Mejía
Done, Welcome!

On Tue, Sep 24, 2019 at 11:17 AM Kishor Joshi  wrote:
>
> Hi,
>
> This is kishor joshi.
> Can someone add me as a contributor for Beam's Jira issue tracker? I would 
> like to create/assign tickets for my work.
>
> My JIRA id is kishorjoshi.
>
> Thanks & regards,
> Kishor
>


Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-24 Thread Vishwas Bm
Hi Ismael,

Can you also add me to this meeting. I would also like to contribute.

Regards,
Vishwas

On 2019/09/19 12:13:40, Etienne Chauchot  wrote: 
> Hi Rahul and Xinyu,I just added you to the list of guests in the meeting. 
> Time is 5pm GMT +2. That being said, for some
> reason last meeting scheduled was 08/28. Ismael initially created the 
> meeting, I do not have the rights to add a new
> date. Ismael can you add a date ?  I suggest 09/25. WDYT ?
> BestEtienne
> Le jeudi 19 septembre 2019 à 00:49 +0530, rahul patwari a écrit :
> > Hi, 
> > I would love to join the call. 
> > Can you also share the meeting invitation with me?
> > 
> > Thanks,
> > Rahul
> > On Wed 18 Sep, 2019, 11:48 PM Xinyu Liu,  wrote:
> > > Alexey and Etienne: I'm very happy to join the sync-up meeting. Please 
> > > forward the meeting info to me. I am based in
> > > California, US and hopefully the time will work :).
> > > Thanks,
> > > Xinyu
> > > On Wed, Sep 18, 2019 at 6:39 AM Etienne Chauchot  
> > > wrote:
> > > > Hi Xinyu,
> > > > Thanks for offering help ! My comments are inline:
> > > > Le vendredi 13 septembre 2019 à 12:16 -0700, Xinyu Liu a écrit :
> > > > > Hi, Etienne,
> > > > > The slides are very informative! Thanks for sharing the details about 
> > > > > how the Beam API are mapped into Spark
> > > > > Structural Streaming. 
> > > > 
> > > > Thanks !
> > > > > We (LinkedIn) are also interested in trying the new SparkRunner to 
> > > > > run Beam pipeine in batch, and contribute to
> > > > > it too. From my understanding, seems the functionality on batch side 
> > > > > is mostly complete and covers quite a large
> > > > > percentage of the tests (a few missing pieces like state and timer in 
> > > > > ParDo and SDF). 
> > > > 
> > > > Correct, it passes 89% of the tests, but there is more than SDF, state 
> > > > and timer missing, there is also ongoing
> > > > encoders work that I would like to commit/push before merging.
> > > > > If so, is it possible to merge the new runner sooner into master so 
> > > > > it's much easier for us to pull it in (we
> > > > > have an internal fork) and contribute back?
> > > > 
> > > > Sure, see my other mail on this thread. As Alexey mentioned, please 
> > > > join the sync meeting we have, the more the
> > > > merrier !
> > > > > Also curious about the scheme part in the runner. Seems we can 
> > > > > leverage the schema-aware work in PCollection and
> > > > > translate from Beam schema to Spark, so it can be optimized in the 
> > > > > planner layer. It will be great to hear back
> > > > > your plans on that.
> > > > 
> > > > Well, it is not designed yet but, if you remember my talk, we need to 
> > > > store beam windowing information with the
> > > > data itself, so ending up having a dataset . One lead 
> > > > that was discussed is to store it as a Spark
> > > > schema such as this:
> > > > 1. field1: binary data for beam windowing information (cannot be mapped 
> > > > to fields  because beam windowing info is
> > > > complex structure)
> > > > 2. fields of data as defined in the Beam schema if there is one 
> > > > 
> > > > > Congrats on this great work!
> > > > Thanks !
> > > > Best,
> > > > Etienne
> > > > > Thanks,
> > > > > Xinyu
> > > > > On Wed, Sep 11, 2019 at 6:02 PM Rui Wang  wrote:
> > > > > > Hello Etienne,
> > > > > > Your slide mentioned that streaming mode development is blocked 
> > > > > > because Spark lacks supporting multiple-
> > > > > > aggregations in its streaming mode but design is ongoing. Do you 
> > > > > > have a link or something else to their design
> > > > > > discussion/doc?
> > > > > > 
> > > > > > 
> > > > > > -Rui  
> > > > > > On Wed, Sep 11, 2019 at 5:10 PM Etienne Chauchot 
> > > > > >  wrote:
> > > > > > > Hi Rahul,Sure, and great ! Thanks for proposing !If you want 
> > > > > > > details, here is the presentation I did 30 mins
> > > > > > > ago at the apachecon. You will find the video on youtube shortly 
> > > > > > > but in the meantime, here is my
> > > > > > > presentation slides.
> > > > > > > And here is the structured streaming branch. I'll be happy to 
> > > > > > > review your PRs, thanks ! 
> > > > > > > https://github.com/apache/beam/tree/spark-runner_structured-streaming
> > > > > > > BestEtienne
> > > > > > > Le mercredi 11 septembre 2019 à 16:37 +0530, rahul patwari a 
> > > > > > > écrit :
> > > > > > > > Hi Etienne,
> > > > > > > > 
> > > > > > > > I came to know about the work going on in Structured Streaming 
> > > > > > > > Spark Runner from Apache Beam Wiki - Works
> > > > > > > > in Progress.
> > > > > > > > I have contributed to BeamSql earlier. And I am working on 
> > > > > > > > supporting PCollectionView in BeamSql.
> > > > > > > > 
> > > > > > > > I would love to understand the Runner's side of Apache Beam and 
> > > > > > > > contribute to the Structured Streaming
> > > > > > > > Spark Runner.
> > > > > > > > 
> > > > > > > > Can you please point me in the right direction?
> > > > > > > > 
> > > > > 

contributor permission for Beam Jira tickets

2019-09-24 Thread Kishor Joshi
Hi,
This is kishor joshi. Can someone add me as a contributor for Beam's Jira issue 
tracker? I would like to create/assign tickets for my work.
My JIRA id is kishorjoshi.
Thanks & regards,Kishor