Re: Collecting feedback for Beam usage

2019-09-26 Thread Kenneth Knowles
Ah, I didn't realize pypi was already collecting py2 vs py3. That saves
having to split artifacts.

Kenn

On Thu, Sep 26, 2019 at 5:03 PM Robert Bradshaw  wrote:

> Pypi download statistics are freely available at
> https://pypistats.org/packages/apache-beam . (To answer the original
> question, nearly all Python 2 at this point, but starting to show a
> drop.)
>
> I think the goal is to get more/orthogonal coverage than a twitter
> poll or waiting for users to speak up on the lists. Getting accurate
> stats (without violating many of the principles we all find vauable)
> woudl be much more difficult, if even possible. In this sense, the
> bias against a large number of production/automated runs doesn't hurt
> the goal of capturing the attention (needed if it's opt-in) of a large
> number of developers.
>
> On Tue, Sep 24, 2019 at 9:15 PM Kenneth Knowles  wrote:
> >
> > Agreeing with many things here and my own flavor to the points:
> > 1. User's privacy is more important than anything else
> > 2. The goal should be to make things better for users
> > 3. Trading user's opt-in for functionality (like Gradle scans) is not
> acceptable
> > 4. It should be effectively invisible to users who are not interested
> > 5. Ideally, we could find some people with expertise in (a) data
> gathering (b) usability (c) privacy (d) whatever we did not think of
> because it is not our expertise. So that we have confidence that our
> results are meaningful and we have done no harm.
> >
> > Some obvious data biases have been mentioned. Here's some more: a lot of
> Beam usage is probably through automation (such as k8s, cron, Airflow, etc)
> where a user is not present when a pipeline is launched. Logging would do
> nothing in these cases, except in case of a failure being debugged. I would
> guess this is the common case. The more a user is actually using Beam in
> production, the less likely they are watching job startup logs. Probably
> many companies use Beam to build a platform for their own users, so
> analytics may not capture the number of actual users in any meaningful way.
> Etc.
> >
> > Certainly, having a sense of the impact of changes like "deprecate
> Python 2" or "make breaking change to pipeline options for old FlinkRunner"
> would be extremely useful, both to us and to our users. We just need to be
> careful. And we must be ready to accept if this is not possible to learn in
> an OK way.
> >
> > I agree with Brian. Download statistics could be a good start for some
> broad questions. We could consider tailoring our convenience binaries to
> facilitate better data gathering, such as a separate py3 pypi coordinate.
> Download stats on released container images could be a way to do this
> without inconveniencing users.
> >
> > Kenn
> >
> > On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov 
> wrote:
> >>
> >> Creating a central place for collecting Beam usage sounds compelling,
> but we'd have to be careful about several aspects:
> >> - It goes without saying that this can never be on-by-default, even for
> a tiny fraction of pipelines.
> >> - For further privacy protection, including the user's PipelineOptions
> is probably out of the question too: people might be including very
> sensitive data in their PipelineOptions (such as database passwords) and we
> wouldn't want to end up storing that data even due to a user's mistake. The
> only data that can be stored is data that Beam developers can guarantee is
> never sensitive, or data intentionally authored by a human for the purpose
> of reporting it to us (e.g. a hand-typed feedback message).
> >> - If it requires the user manually clicking the link, then it will not
> collect data about automated invocations of any pipelines, whereas likely
> almost all practical invocations are automated - the difference between
> COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
> >> - Moreover, many practical invocations likely go through an
> intermediate library / product, such as scio or talend. There'd need to be
> a story for library developers to offer this capability to their users.
> >> - The condition "was feedback reported for this pipeline", regardless
> of whether it is reported manually (by clicking the link) or automatically
> (by explicitly enabling some flag), heavily biases the sample - people are
> unlikely to click the link if the pipeline works fine (and almost all
> production pipelines work fine, otherwise they wouldn't be in production),
> and I don't know what considerations would prompt somebody to enable the
> flag for a periodic production pipeline. Meaning, the collected data likely
> can not be reliably used for any aggregation/counting, except for picking
> out interesting individual examples for case studies.
> >> - Measures should be taken to ensure that people don't accidentally
> enable it in their quick-running direct runner unit tests, causing lots of
> traffic.
> >> - I would not dismiss the possibility of spam and attacks.
> >>
> >> 

Re: Collecting feedback for Beam usage

2019-09-26 Thread Robert Bradshaw
Pypi download statistics are freely available at
https://pypistats.org/packages/apache-beam . (To answer the original
question, nearly all Python 2 at this point, but starting to show a
drop.)

I think the goal is to get more/orthogonal coverage than a twitter
poll or waiting for users to speak up on the lists. Getting accurate
stats (without violating many of the principles we all find vauable)
woudl be much more difficult, if even possible. In this sense, the
bias against a large number of production/automated runs doesn't hurt
the goal of capturing the attention (needed if it's opt-in) of a large
number of developers.

On Tue, Sep 24, 2019 at 9:15 PM Kenneth Knowles  wrote:
>
> Agreeing with many things here and my own flavor to the points:
> 1. User's privacy is more important than anything else
> 2. The goal should be to make things better for users
> 3. Trading user's opt-in for functionality (like Gradle scans) is not 
> acceptable
> 4. It should be effectively invisible to users who are not interested
> 5. Ideally, we could find some people with expertise in (a) data gathering 
> (b) usability (c) privacy (d) whatever we did not think of because it is not 
> our expertise. So that we have confidence that our results are meaningful and 
> we have done no harm.
>
> Some obvious data biases have been mentioned. Here's some more: a lot of Beam 
> usage is probably through automation (such as k8s, cron, Airflow, etc) where 
> a user is not present when a pipeline is launched. Logging would do nothing 
> in these cases, except in case of a failure being debugged. I would guess 
> this is the common case. The more a user is actually using Beam in 
> production, the less likely they are watching job startup logs. Probably many 
> companies use Beam to build a platform for their own users, so analytics may 
> not capture the number of actual users in any meaningful way. Etc.
>
> Certainly, having a sense of the impact of changes like "deprecate Python 2" 
> or "make breaking change to pipeline options for old FlinkRunner" would be 
> extremely useful, both to us and to our users. We just need to be careful. 
> And we must be ready to accept if this is not possible to learn in an OK way.
>
> I agree with Brian. Download statistics could be a good start for some broad 
> questions. We could consider tailoring our convenience binaries to facilitate 
> better data gathering, such as a separate py3 pypi coordinate. Download stats 
> on released container images could be a way to do this without 
> inconveniencing users.
>
> Kenn
>
> On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov  wrote:
>>
>> Creating a central place for collecting Beam usage sounds compelling, but 
>> we'd have to be careful about several aspects:
>> - It goes without saying that this can never be on-by-default, even for a 
>> tiny fraction of pipelines.
>> - For further privacy protection, including the user's PipelineOptions is 
>> probably out of the question too: people might be including very sensitive 
>> data in their PipelineOptions (such as database passwords) and we wouldn't 
>> want to end up storing that data even due to a user's mistake. The only data 
>> that can be stored is data that Beam developers can guarantee is never 
>> sensitive, or data intentionally authored by a human for the purpose of 
>> reporting it to us (e.g. a hand-typed feedback message).
>> - If it requires the user manually clicking the link, then it will not 
>> collect data about automated invocations of any pipelines, whereas likely 
>> almost all practical invocations are automated - the difference between 
>> COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
>> - Moreover, many practical invocations likely go through an intermediate 
>> library / product, such as scio or talend. There'd need to be a story for 
>> library developers to offer this capability to their users.
>> - The condition "was feedback reported for this pipeline", regardless of 
>> whether it is reported manually (by clicking the link) or automatically (by 
>> explicitly enabling some flag), heavily biases the sample - people are 
>> unlikely to click the link if the pipeline works fine (and almost all 
>> production pipelines work fine, otherwise they wouldn't be in production), 
>> and I don't know what considerations would prompt somebody to enable the 
>> flag for a periodic production pipeline. Meaning, the collected data likely 
>> can not be reliably used for any aggregation/counting, except for picking 
>> out interesting individual examples for case studies.
>> - Measures should be taken to ensure that people don't accidentally enable 
>> it in their quick-running direct runner unit tests, causing lots of traffic.
>> - I would not dismiss the possibility of spam and attacks.
>>
>> I'd recommend to start by listing the questions we're hoping to answer using 
>> the collected feedback, and then judging whether the proposed method indeed 
>> allows answering them while 

Re: Collecting feedback for Beam usage

2019-09-24 Thread Kenneth Knowles
Agreeing with many things here and my own flavor to the points:
1. User's privacy is more important than anything else
2. The goal should be to make things better for users
3. Trading user's opt-in for functionality (like Gradle scans) is not
acceptable
4. It should be effectively invisible to users who are not interested
5. Ideally, we could find some people with expertise in (a) data gathering
(b) usability (c) privacy (d) whatever we did not think of because it is
not our expertise. So that we have confidence that our results are
meaningful and we have done no harm.

Some obvious data biases have been mentioned. Here's some more: a lot of
Beam usage is probably through automation (such as k8s, cron, Airflow, etc)
where a user is not present when a pipeline is launched. Logging would do
nothing in these cases, except in case of a failure being debugged. I would
guess this is the common case. The more a user is actually using Beam in
production, the less likely they are watching job startup logs. Probably
many companies use Beam to build a platform for their own users, so
analytics may not capture the number of actual users in any meaningful way.
Etc.

Certainly, having a sense of the impact of changes like "deprecate Python
2" or "make breaking change to pipeline options for old FlinkRunner" would
be extremely useful, both to us and to our users. We just need to be
careful. And we must be ready to accept if this is not possible to learn in
an OK way.

I agree with Brian. Download statistics could be a good start for some
broad questions. We could consider tailoring our convenience binaries to
facilitate better data gathering, such as a separate py3 pypi coordinate.
Download stats on released container images could be a way to do this
without inconveniencing users.

Kenn

On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov 
wrote:

> Creating a central place for collecting Beam usage sounds compelling, but
> we'd have to be careful about several aspects:
> - It goes without saying that this can never be on-by-default, even for a
> tiny fraction of pipelines.
> - For further privacy protection, including the user's PipelineOptions is
> probably out of the question too: people might be including very sensitive
> data in their PipelineOptions (such as database passwords) and we wouldn't
> want to end up storing that data even due to a user's mistake. The only
> data that can be stored is data that Beam developers can guarantee is never
> sensitive, or data intentionally authored by a human for the purpose of
> reporting it to us (e.g. a hand-typed feedback message).
> - If it requires the user manually clicking the link, then it will not
> collect data about automated invocations of any pipelines, whereas likely
> almost all practical invocations are automated - the difference between
> COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
> - Moreover, many practical invocations likely go through an intermediate
> library / product, such as scio or talend. There'd need to be a story for
> library developers to offer this capability to their users.
> - The condition "was feedback reported for this pipeline", regardless of
> whether it is reported manually (by clicking the link) or automatically (by
> explicitly enabling some flag), heavily biases the sample - people are
> unlikely to click the link if the pipeline works fine (and almost all
> production pipelines work fine, otherwise they wouldn't be in production),
> and I don't know what considerations would prompt somebody to enable the
> flag for a periodic production pipeline. Meaning, the collected data likely
> can not be reliably used for any aggregation/counting, except for picking
> out interesting individual examples for case studies.
> - Measures should be taken to ensure that people don't accidentally enable
> it in their quick-running direct runner unit tests, causing lots of traffic.
> - I would not dismiss the possibility of spam and attacks.
>
> I'd recommend to start by listing the questions we're hoping to answer
> using the collected feedback, and then judging whether the proposed method
> indeed allows answering them while respecting the users' privacy.
>
> On Tue, Sep 24, 2019 at 1:49 PM Lukasz Cwik  wrote:
>
>> One of the options could be to just display the URL and not to phone
>> home. I would like it so that users can integrate this into their
>> deployment solution so we get regular stats instead of only when a user
>> decides to run a pipeline manually.
>>
>> On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
>> wrote:
>>
>>> I think the goal is to lower the barrier of entry. Displaying a URL to
>>> click on while waiting for your pipeline to start up, that contains
>>> all the data explicitly visible, is about as easy as it gets.
>>> Remembering to run a new (probably not as authentic) pipeline with
>>> that flag is less so.
>>>
>>> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
>>> wrote:
>>> >
>>> > I'm with Luke on this. We 

Re: Collecting feedback for Beam usage

2019-09-24 Thread Eugene Kirpichov
Creating a central place for collecting Beam usage sounds compelling, but
we'd have to be careful about several aspects:
- It goes without saying that this can never be on-by-default, even for a
tiny fraction of pipelines.
- For further privacy protection, including the user's PipelineOptions is
probably out of the question too: people might be including very sensitive
data in their PipelineOptions (such as database passwords) and we wouldn't
want to end up storing that data even due to a user's mistake. The only
data that can be stored is data that Beam developers can guarantee is never
sensitive, or data intentionally authored by a human for the purpose of
reporting it to us (e.g. a hand-typed feedback message).
- If it requires the user manually clicking the link, then it will not
collect data about automated invocations of any pipelines, whereas likely
almost all practical invocations are automated - the difference between
COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
- Moreover, many practical invocations likely go through an intermediate
library / product, such as scio or talend. There'd need to be a story for
library developers to offer this capability to their users.
- The condition "was feedback reported for this pipeline", regardless of
whether it is reported manually (by clicking the link) or automatically (by
explicitly enabling some flag), heavily biases the sample - people are
unlikely to click the link if the pipeline works fine (and almost all
production pipelines work fine, otherwise they wouldn't be in production),
and I don't know what considerations would prompt somebody to enable the
flag for a periodic production pipeline. Meaning, the collected data likely
can not be reliably used for any aggregation/counting, except for picking
out interesting individual examples for case studies.
- Measures should be taken to ensure that people don't accidentally enable
it in their quick-running direct runner unit tests, causing lots of traffic.
- I would not dismiss the possibility of spam and attacks.

I'd recommend to start by listing the questions we're hoping to answer
using the collected feedback, and then judging whether the proposed method
indeed allows answering them while respecting the users' privacy.

On Tue, Sep 24, 2019 at 1:49 PM Lukasz Cwik  wrote:

> One of the options could be to just display the URL and not to phone home.
> I would like it so that users can integrate this into their deployment
> solution so we get regular stats instead of only when a user decides to run
> a pipeline manually.
>
> On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
> wrote:
>
>> I think the goal is to lower the barrier of entry. Displaying a URL to
>> click on while waiting for your pipeline to start up, that contains
>> all the data explicitly visible, is about as easy as it gets.
>> Remembering to run a new (probably not as authentic) pipeline with
>> that flag is less so.
>>
>> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
>> wrote:
>> >
>> > I'm with Luke on this. We can add a set of flags to send home stats and
>> crash dumps if user agrees. If we keep code isolated, it will be easy
>> enough for user to check what is being sent.
>> >
>> > One more heavy-weight option is to also allow user configure and
>> persist what information he is ok with sharing.
>> >
>> > --Mikhail
>> >
>> >
>> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:
>> >>
>> >> Why not add a flag to the SDK that would do the phone home when
>> specified?
>> >>
>> >> From a support perspective it would be useful to know:
>> >> * SDK version
>> >> * Runner
>> >> * SDK provided PTransforms that are used
>> >> * Features like user state/timers/side inputs/splittable dofns/...
>> >> * Graph complexity (# nodes, # branches, ...)
>> >> * Pipeline failed or succeeded
>> >>
>> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
>> wrote:
>> >>> >
>> >>> > Would people actually click on that link though? I think Kyle has a
>> point that in practice users would only find and click on that link when
>> they're having some kind of issue, especially if the link has "feedback" in
>> it.
>> >>>
>> >>> I think the idea is that we would make the link very light-weight,
>> >>> kind of like a survey (but even easier as it's pre-populated).
>> >>> Basically an opt-in phone-home. If we don't collect any personal data
>> >>> (not even IP/geo, just (say) version + runner, all visible in the
>> >>> URL), no need to guard/anonymize (and this may be sufficient--I don't
>> >>> think we have to worry about spammers and ballot stuffers given the
>> >>> target audience). If we can catch people while they wait for their
>> >>> pipeline to start up (and/or complete), this is a great time to get
>> >>> some feedback.
>> >>>
>> >>> > I agree usage data would be really valuable, but I'm not sure that
>> this approach would get us good data. Is there a way to get download

Re: Collecting feedback for Beam usage

2019-09-24 Thread Lukasz Cwik
One of the options could be to just display the URL and not to phone home.
I would like it so that users can integrate this into their deployment
solution so we get regular stats instead of only when a user decides to run
a pipeline manually.

On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
wrote:

> I think the goal is to lower the barrier of entry. Displaying a URL to
> click on while waiting for your pipeline to start up, that contains
> all the data explicitly visible, is about as easy as it gets.
> Remembering to run a new (probably not as authentic) pipeline with
> that flag is less so.
>
> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
> wrote:
> >
> > I'm with Luke on this. We can add a set of flags to send home stats and
> crash dumps if user agrees. If we keep code isolated, it will be easy
> enough for user to check what is being sent.
> >
> > One more heavy-weight option is to also allow user configure and persist
> what information he is ok with sharing.
> >
> > --Mikhail
> >
> >
> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:
> >>
> >> Why not add a flag to the SDK that would do the phone home when
> specified?
> >>
> >> From a support perspective it would be useful to know:
> >> * SDK version
> >> * Runner
> >> * SDK provided PTransforms that are used
> >> * Features like user state/timers/side inputs/splittable dofns/...
> >> * Graph complexity (# nodes, # branches, ...)
> >> * Pipeline failed or succeeded
> >>
> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
> wrote:
> >>>
> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
> wrote:
> >>> >
> >>> > Would people actually click on that link though? I think Kyle has a
> point that in practice users would only find and click on that link when
> they're having some kind of issue, especially if the link has "feedback" in
> it.
> >>>
> >>> I think the idea is that we would make the link very light-weight,
> >>> kind of like a survey (but even easier as it's pre-populated).
> >>> Basically an opt-in phone-home. If we don't collect any personal data
> >>> (not even IP/geo, just (say) version + runner, all visible in the
> >>> URL), no need to guard/anonymize (and this may be sufficient--I don't
> >>> think we have to worry about spammers and ballot stuffers given the
> >>> target audience). If we can catch people while they wait for their
> >>> pipeline to start up (and/or complete), this is a great time to get
> >>> some feedback.
> >>>
> >>> > I agree usage data would be really valuable, but I'm not sure that
> this approach would get us good data. Is there a way to get download
> statistics for the different runner artifacts? Maybe that could be a better
> metric to compare usage.
> >>>
> >>> This'd be useful too, but hard to get and very noisy.
> >>>
> >>> >
> >>> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka 
> wrote:
> >>> >>
> >>> >> I agree, these are the questions that need to be answered.
> >>> >> The data can be anonymize and stored as public data in BigQuery or
> some other place.
> >>> >>
> >>> >> The intent is to get the usage statistics so that we can get to
> know what people are using Flink or Spark etc and not intended for
> discussion or a help channel.
> >>> >> I also think that we don't need to monitor this actively as it's
> more like a survey rather than active channel to get issues resolved.
> >>> >>
> >>> >> If we think its useful for the community then we come up with the
> solution as to how can we do this (similar to how we released the container
> images).
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
> wrote:
> >>> >>>
> >>> >>> There are some logistics that would need worked out. For example,
> Where would the data go? Who would own it?
> >>> >>>
> >>> >>> Also, I'm not convinced we need yet another place to discuss Beam
> when we already have discussed the challenge of simultaneously monitoring
> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
> certainly an interesting question, and I'd be curious to know that >= X
> many people use a certain runner, I'm not sure answers to these questions
> are as useful for guiding the future of Beam as discussions on the
> dev/users lists, etc. as the latter likely result in more depth/specific
> feedback.
> >>> >>>
> >>> >>> However, I do think it could be useful in general to include links
> directly in the console output. For example, maybe something along the
> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
> mailing list."
> >>> >>>
> >>> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>> >>>
> >>> >>>
> >>> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka 
> wrote:
> >>> 
> >>>  Hi,
> >>> 
> >>>  At the moment we don't really have a good way to collect any
> usage statistics for Apache Beam. Like runner used etc. As many of the
> users don't really have a way to report their usecase.
> >>>  How about if we 

Re: Collecting feedback for Beam usage

2019-09-24 Thread Mikhail Gryzykhin
I'm with Luke on this. We can add a set of flags to send home stats and
crash dumps if user agrees. If we keep code isolated, it will be easy
enough for user to check what is being sent.

One more heavy-weight option is to also allow user configure and persist
what information he is ok with sharing.

--Mikhail


On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:

> Why not add a flag to the SDK that would do the phone home when specified?
>
> From a support perspective it would be useful to know:
> * SDK version
> * Runner
> * SDK provided PTransforms that are used
> * Features like user state/timers/side inputs/splittable dofns/...
> * Graph complexity (# nodes, # branches, ...)
> * Pipeline failed or succeeded
>
> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
> wrote:
>
>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
>> wrote:
>> >
>> > Would people actually click on that link though? I think Kyle has a
>> point that in practice users would only find and click on that link when
>> they're having some kind of issue, especially if the link has "feedback" in
>> it.
>>
>> I think the idea is that we would make the link very light-weight,
>> kind of like a survey (but even easier as it's pre-populated).
>> Basically an opt-in phone-home. If we don't collect any personal data
>> (not even IP/geo, just (say) version + runner, all visible in the
>> URL), no need to guard/anonymize (and this may be sufficient--I don't
>> think we have to worry about spammers and ballot stuffers given the
>> target audience). If we can catch people while they wait for their
>> pipeline to start up (and/or complete), this is a great time to get
>> some feedback.
>>
>> > I agree usage data would be really valuable, but I'm not sure that this
>> approach would get us good data. Is there a way to get download statistics
>> for the different runner artifacts? Maybe that could be a better metric to
>> compare usage.
>>
>> This'd be useful too, but hard to get and very noisy.
>>
>> >
>> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:
>> >>
>> >> I agree, these are the questions that need to be answered.
>> >> The data can be anonymize and stored as public data in BigQuery or
>> some other place.
>> >>
>> >> The intent is to get the usage statistics so that we can get to know
>> what people are using Flink or Spark etc and not intended for discussion or
>> a help channel.
>> >> I also think that we don't need to monitor this actively as it's more
>> like a survey rather than active channel to get issues resolved.
>> >>
>> >> If we think its useful for the community then we come up with the
>> solution as to how can we do this (similar to how we released the container
>> images).
>> >>
>> >>
>> >>
>> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
>> wrote:
>> >>>
>> >>> There are some logistics that would need worked out. For example,
>> Where would the data go? Who would own it?
>> >>>
>> >>> Also, I'm not convinced we need yet another place to discuss Beam
>> when we already have discussed the challenge of simultaneously monitoring
>> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
>> certainly an interesting question, and I'd be curious to know that >= X
>> many people use a certain runner, I'm not sure answers to these questions
>> are as useful for guiding the future of Beam as discussions on the
>> dev/users lists, etc. as the latter likely result in more depth/specific
>> feedback.
>> >>>
>> >>> However, I do think it could be useful in general to include links
>> directly in the console output. For example, maybe something along the
>> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
>> mailing list."
>> >>>
>> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com
>> >>>
>> >>>
>> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka 
>> wrote:
>> 
>>  Hi,
>> 
>>  At the moment we don't really have a good way to collect any usage
>> statistics for Apache Beam. Like runner used etc. As many of the users
>> don't really have a way to report their usecase.
>>  How about if we create a feedback page where users can add their
>> pipeline details and usecase.
>>  Also, we can start printing the link to this page when user launch
>> the pipeline in the command line.
>>  Example:
>>  $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
>> 
>>  Starting pipeline
>>  Please use
>> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>>  Pipeline started
>>  ..
>> 
>>  Using a link and not publishing the data automatically will give
>> user control over what they publish and what they don't. We can enhance the
>> text and usage further but the basic idea is to ask for user feeback at
>> each run of the pipeline.
>>  Let me know what you think.
>> 
>> 
>>  Thanks,
>>  Ankur
>>
>


Re: Collecting feedback for Beam usage

2019-09-24 Thread Lukasz Cwik
Why not add a flag to the SDK that would do the phone home when specified?

>From a support perspective it would be useful to know:
* SDK version
* Runner
* SDK provided PTransforms that are used
* Features like user state/timers/side inputs/splittable dofns/...
* Graph complexity (# nodes, # branches, ...)
* Pipeline failed or succeeded

On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw  wrote:

> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette  wrote:
> >
> > Would people actually click on that link though? I think Kyle has a
> point that in practice users would only find and click on that link when
> they're having some kind of issue, especially if the link has "feedback" in
> it.
>
> I think the idea is that we would make the link very light-weight,
> kind of like a survey (but even easier as it's pre-populated).
> Basically an opt-in phone-home. If we don't collect any personal data
> (not even IP/geo, just (say) version + runner, all visible in the
> URL), no need to guard/anonymize (and this may be sufficient--I don't
> think we have to worry about spammers and ballot stuffers given the
> target audience). If we can catch people while they wait for their
> pipeline to start up (and/or complete), this is a great time to get
> some feedback.
>
> > I agree usage data would be really valuable, but I'm not sure that this
> approach would get us good data. Is there a way to get download statistics
> for the different runner artifacts? Maybe that could be a better metric to
> compare usage.
>
> This'd be useful too, but hard to get and very noisy.
>
> >
> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:
> >>
> >> I agree, these are the questions that need to be answered.
> >> The data can be anonymize and stored as public data in BigQuery or some
> other place.
> >>
> >> The intent is to get the usage statistics so that we can get to know
> what people are using Flink or Spark etc and not intended for discussion or
> a help channel.
> >> I also think that we don't need to monitor this actively as it's more
> like a survey rather than active channel to get issues resolved.
> >>
> >> If we think its useful for the community then we come up with the
> solution as to how can we do this (similar to how we released the container
> images).
> >>
> >>
> >>
> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
> wrote:
> >>>
> >>> There are some logistics that would need worked out. For example,
> Where would the data go? Who would own it?
> >>>
> >>> Also, I'm not convinced we need yet another place to discuss Beam when
> we already have discussed the challenge of simultaneously monitoring
> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
> certainly an interesting question, and I'd be curious to know that >= X
> many people use a certain runner, I'm not sure answers to these questions
> are as useful for guiding the future of Beam as discussions on the
> dev/users lists, etc. as the latter likely result in more depth/specific
> feedback.
> >>>
> >>> However, I do think it could be useful in general to include links
> directly in the console output. For example, maybe something along the
> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
> mailing list."
> >>>
> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>>
> >>>
> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka 
> wrote:
> 
>  Hi,
> 
>  At the moment we don't really have a good way to collect any usage
> statistics for Apache Beam. Like runner used etc. As many of the users
> don't really have a way to report their usecase.
>  How about if we create a feedback page where users can add their
> pipeline details and usecase.
>  Also, we can start printing the link to this page when user launch
> the pipeline in the command line.
>  Example:
>  $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
> 
>  Starting pipeline
>  Please use
> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>  Pipeline started
>  ..
> 
>  Using a link and not publishing the data automatically will give user
> control over what they publish and what they don't. We can enhance the text
> and usage further but the basic idea is to ask for user feeback at each run
> of the pipeline.
>  Let me know what you think.
> 
> 
>  Thanks,
>  Ankur
>


Re: Collecting feedback for Beam usage

2019-09-23 Thread Robert Bradshaw
On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette  wrote:
>
> Would people actually click on that link though? I think Kyle has a point 
> that in practice users would only find and click on that link when they're 
> having some kind of issue, especially if the link has "feedback" in it.

I think the idea is that we would make the link very light-weight,
kind of like a survey (but even easier as it's pre-populated).
Basically an opt-in phone-home. If we don't collect any personal data
(not even IP/geo, just (say) version + runner, all visible in the
URL), no need to guard/anonymize (and this may be sufficient--I don't
think we have to worry about spammers and ballot stuffers given the
target audience). If we can catch people while they wait for their
pipeline to start up (and/or complete), this is a great time to get
some feedback.

> I agree usage data would be really valuable, but I'm not sure that this 
> approach would get us good data. Is there a way to get download statistics 
> for the different runner artifacts? Maybe that could be a better metric to 
> compare usage.

This'd be useful too, but hard to get and very noisy.

>
> On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:
>>
>> I agree, these are the questions that need to be answered.
>> The data can be anonymize and stored as public data in BigQuery or some 
>> other place.
>>
>> The intent is to get the usage statistics so that we can get to know what 
>> people are using Flink or Spark etc and not intended for discussion or a 
>> help channel.
>> I also think that we don't need to monitor this actively as it's more like a 
>> survey rather than active channel to get issues resolved.
>>
>> If we think its useful for the community then we come up with the solution 
>> as to how can we do this (similar to how we released the container images).
>>
>>
>>
>> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver  wrote:
>>>
>>> There are some logistics that would need worked out. For example, Where 
>>> would the data go? Who would own it?
>>>
>>> Also, I'm not convinced we need yet another place to discuss Beam when we 
>>> already have discussed the challenge of simultaneously monitoring mailing 
>>> lists, Stack Overflow, Slack, etc. While "how do you use Beam" is certainly 
>>> an interesting question, and I'd be curious to know that >= X many people 
>>> use a certain runner, I'm not sure answers to these questions are as useful 
>>> for guiding the future of Beam as discussions on the dev/users lists, etc. 
>>> as the latter likely result in more depth/specific feedback.
>>>
>>> However, I do think it could be useful in general to include links directly 
>>> in the console output. For example, maybe something along the lines of "Oh 
>>> no, your Flink pipeline crashed! Check Jira/file a bug/ask the mailing 
>>> list."
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>>
>>>
>>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka  wrote:

 Hi,

 At the moment we don't really have a good way to collect any usage 
 statistics for Apache Beam. Like runner used etc. As many of the users 
 don't really have a way to report their usecase.
 How about if we create a feedback page where users can add their pipeline 
 details and usecase.
 Also, we can start printing the link to this page when user launch the 
 pipeline in the command line.
 Example:
 $ python my_pipeline.py --runner DirectRunner --input /tmp/abc

 Starting pipeline
 Please use http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
 Pipeline started
 ..

 Using a link and not publishing the data automatically will give user 
 control over what they publish and what they don't. We can enhance the 
 text and usage further but the basic idea is to ask for user feeback at 
 each run of the pipeline.
 Let me know what you think.


 Thanks,
 Ankur


Re: Collecting feedback for Beam usage

2019-09-23 Thread Chad Dombrova
A survey would be a good place to start.  This came up in the
python2-sunsetting thread as well: we don't know what versions of python
people are using with Beam, which makes it difficult to answer the question
of support.

-chad


On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:

> I agree, these are the questions that need to be answered.
> The data can be anonymize and stored as public data in BigQuery or some
> other place.
>
> The intent is to get the usage statistics so that we can get to know what
> people are using Flink or Spark etc and not intended for discussion or a
> help channel.
> I also think that we don't need to monitor this actively as it's more like
> a survey rather than active channel to get issues resolved.
>
> If we think its useful for the community then we come up with the solution
> as to how can we do this (similar to how we released the container images).
>
>
>
> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver  wrote:
>
>> There are some logistics that would need worked out. For example, Where
>> would the data go? Who would own it?
>>
>> Also, I'm not convinced we need yet another place to discuss Beam when we
>> already have discussed the challenge of simultaneously monitoring mailing
>> lists, Stack Overflow, Slack, etc. While "how do you use Beam" is certainly
>> an interesting question, and I'd be curious to know that >= X many people
>> use a certain runner, I'm not sure answers to these questions are as useful
>> for guiding the future of Beam as discussions on the dev/users lists, etc.
>> as the latter likely result in more depth/specific feedback.
>>
>> However, I do think it could be useful in general to include links
>> directly in the console output. For example, maybe something along the
>> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
>> mailing list."
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka  wrote:
>>
>>> Hi,
>>>
>>> At the moment we don't really have a good way to collect any usage
>>> statistics for Apache Beam. Like runner used etc. As many of the users
>>> don't really have a way to report their usecase.
>>> How about if we create a feedback page where users can add their
>>> pipeline details and usecase.
>>> Also, we can start printing the link to this page when user launch the
>>> pipeline in the command line.
>>> Example:
>>> $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
>>>
>>> Starting pipeline
>>> Please use
>>> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>>> Pipeline started
>>> ..
>>>
>>> Using a link and not publishing the data automatically will give user
>>> control over what they publish and what they don't. We can enhance the text
>>> and usage further but the basic idea is to ask for user feeback at each run
>>> of the pipeline.
>>> Let me know what you think.
>>>
>>>
>>> Thanks,
>>> Ankur
>>>
>>


Re: Collecting feedback for Beam usage

2019-09-23 Thread Brian Hulette
Would people actually click on that link though? I think Kyle has a point
that in practice users would only find and click on that link when they're
having some kind of issue, especially if the link has "feedback" in it.

I agree usage data would be really valuable, but I'm not sure that this
approach would get us good data. Is there a way to get download statistics
for the different runner artifacts? Maybe that could be a better metric to
compare usage.

On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:

> I agree, these are the questions that need to be answered.
> The data can be anonymize and stored as public data in BigQuery or some
> other place.
>
> The intent is to get the usage statistics so that we can get to know what
> people are using Flink or Spark etc and not intended for discussion or a
> help channel.
> I also think that we don't need to monitor this actively as it's more like
> a survey rather than active channel to get issues resolved.
>
> If we think its useful for the community then we come up with the solution
> as to how can we do this (similar to how we released the container images).
>
>
>
> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver  wrote:
>
>> There are some logistics that would need worked out. For example, Where
>> would the data go? Who would own it?
>>
>> Also, I'm not convinced we need yet another place to discuss Beam when we
>> already have discussed the challenge of simultaneously monitoring mailing
>> lists, Stack Overflow, Slack, etc. While "how do you use Beam" is certainly
>> an interesting question, and I'd be curious to know that >= X many people
>> use a certain runner, I'm not sure answers to these questions are as useful
>> for guiding the future of Beam as discussions on the dev/users lists, etc.
>> as the latter likely result in more depth/specific feedback.
>>
>> However, I do think it could be useful in general to include links
>> directly in the console output. For example, maybe something along the
>> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
>> mailing list."
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka  wrote:
>>
>>> Hi,
>>>
>>> At the moment we don't really have a good way to collect any usage
>>> statistics for Apache Beam. Like runner used etc. As many of the users
>>> don't really have a way to report their usecase.
>>> How about if we create a feedback page where users can add their
>>> pipeline details and usecase.
>>> Also, we can start printing the link to this page when user launch the
>>> pipeline in the command line.
>>> Example:
>>> $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
>>>
>>> Starting pipeline
>>> Please use
>>> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>>> Pipeline started
>>> ..
>>>
>>> Using a link and not publishing the data automatically will give user
>>> control over what they publish and what they don't. We can enhance the text
>>> and usage further but the basic idea is to ask for user feeback at each run
>>> of the pipeline.
>>> Let me know what you think.
>>>
>>>
>>> Thanks,
>>> Ankur
>>>
>>


Re: Collecting feedback for Beam usage

2019-09-23 Thread Ankur Goenka
I agree, these are the questions that need to be answered.
The data can be anonymize and stored as public data in BigQuery or some
other place.

The intent is to get the usage statistics so that we can get to know what
people are using Flink or Spark etc and not intended for discussion or a
help channel.
I also think that we don't need to monitor this actively as it's more like
a survey rather than active channel to get issues resolved.

If we think its useful for the community then we come up with the solution
as to how can we do this (similar to how we released the container images).



On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver  wrote:

> There are some logistics that would need worked out. For example, Where
> would the data go? Who would own it?
>
> Also, I'm not convinced we need yet another place to discuss Beam when we
> already have discussed the challenge of simultaneously monitoring mailing
> lists, Stack Overflow, Slack, etc. While "how do you use Beam" is certainly
> an interesting question, and I'd be curious to know that >= X many people
> use a certain runner, I'm not sure answers to these questions are as useful
> for guiding the future of Beam as discussions on the dev/users lists, etc.
> as the latter likely result in more depth/specific feedback.
>
> However, I do think it could be useful in general to include links
> directly in the console output. For example, maybe something along the
> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
> mailing list."
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>
>
> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka  wrote:
>
>> Hi,
>>
>> At the moment we don't really have a good way to collect any usage
>> statistics for Apache Beam. Like runner used etc. As many of the users
>> don't really have a way to report their usecase.
>> How about if we create a feedback page where users can add their pipeline
>> details and usecase.
>> Also, we can start printing the link to this page when user launch the
>> pipeline in the command line.
>> Example:
>> $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
>>
>> Starting pipeline
>> Please use
>> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>> Pipeline started
>> ..
>>
>> Using a link and not publishing the data automatically will give user
>> control over what they publish and what they don't. We can enhance the text
>> and usage further but the basic idea is to ask for user feeback at each run
>> of the pipeline.
>> Let me know what you think.
>>
>>
>> Thanks,
>> Ankur
>>
>


Re: Collecting feedback for Beam usage

2019-09-20 Thread Kyle Weaver
There are some logistics that would need worked out. For example, Where
would the data go? Who would own it?

Also, I'm not convinced we need yet another place to discuss Beam when we
already have discussed the challenge of simultaneously monitoring mailing
lists, Stack Overflow, Slack, etc. While "how do you use Beam" is certainly
an interesting question, and I'd be curious to know that >= X many people
use a certain runner, I'm not sure answers to these questions are as useful
for guiding the future of Beam as discussions on the dev/users lists, etc.
as the latter likely result in more depth/specific feedback.

However, I do think it could be useful in general to include links directly
in the console output. For example, maybe something along the lines of "Oh
no, your Flink pipeline crashed! Check Jira/file a bug/ask the mailing
list."

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com


On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka  wrote:

> Hi,
>
> At the moment we don't really have a good way to collect any usage
> statistics for Apache Beam. Like runner used etc. As many of the users
> don't really have a way to report their usecase.
> How about if we create a feedback page where users can add their pipeline
> details and usecase.
> Also, we can start printing the link to this page when user launch the
> pipeline in the command line.
> Example:
> $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
>
> Starting pipeline
> Please use
> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
> Pipeline started
> ..
>
> Using a link and not publishing the data automatically will give user
> control over what they publish and what they don't. We can enhance the text
> and usage further but the basic idea is to ask for user feeback at each run
> of the pipeline.
> Let me know what you think.
>
>
> Thanks,
> Ankur
>


Collecting feedback for Beam usage

2019-09-20 Thread Ankur Goenka
Hi,

At the moment we don't really have a good way to collect any usage
statistics for Apache Beam. Like runner used etc. As many of the users
don't really have a way to report their usecase.
How about if we create a feedback page where users can add their pipeline
details and usecase.
Also, we can start printing the link to this page when user launch the
pipeline in the command line.
Example:
$ python my_pipeline.py --runner DirectRunner --input /tmp/abc

Starting pipeline
Please use http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
Pipeline started
..

Using a link and not publishing the data automatically will give user
control over what they publish and what they don't. We can enhance the text
and usage further but the basic idea is to ask for user feeback at each run
of the pipeline.
Let me know what you think.


Thanks,
Ankur