Re: Collecting feedback for Beam usage

Kenneth Knowles Thu, 26 Sep 2019 17:21:30 -0700

Ah, I didn't realize pypi was already collecting py2 vs py3. That saves
having to split artifacts.


Kenn

On Thu, Sep 26, 2019 at 5:03 PM Robert Bradshaw <rober...@google.com> wrote:

> Pypi download statistics are freely available at
> https://pypistats.org/packages/apache-beam . (To answer the original
> question, nearly all Python 2 at this point, but starting to show a
> drop.)
>
> I think the goal is to get more/orthogonal coverage than a twitter
> poll or waiting for users to speak up on the lists. Getting accurate
> stats (without violating many of the principles we all find vauable)
> woudl be much more difficult, if even possible. In this sense, the
> bias against a large number of production/automated runs doesn't hurt
> the goal of capturing the attention (needed if it's opt-in) of a large
> number of developers.
>
> On Tue, Sep 24, 2019 at 9:15 PM Kenneth Knowles <k...@apache.org> wrote:
> >
> > Agreeing with many things here and my own flavor to the points:
> > 1. User's privacy is more important than anything else
> > 2. The goal should be to make things better for users
> > 3. Trading user's opt-in for functionality (like Gradle scans) is not
> acceptable
> > 4. It should be effectively invisible to users who are not interested
> > 5. Ideally, we could find some people with expertise in (a) data
> gathering (b) usability (c) privacy (d) whatever we did not think of
> because it is not our expertise. So that we have confidence that our
> results are meaningful and we have done no harm.
> >
> > Some obvious data biases have been mentioned. Here's some more: a lot of
> Beam usage is probably through automation (such as k8s, cron, Airflow, etc)
> where a user is not present when a pipeline is launched. Logging would do
> nothing in these cases, except in case of a failure being debugged. I would
> guess this is the common case. The more a user is actually using Beam in
> production, the less likely they are watching job startup logs. Probably
> many companies use Beam to build a platform for their own users, so
> analytics may not capture the number of actual users in any meaningful way.
> Etc.
> >
> > Certainly, having a sense of the impact of changes like "deprecate
> Python 2" or "make breaking change to pipeline options for old FlinkRunner"
> would be extremely useful, both to us and to our users. We just need to be
> careful. And we must be ready to accept if this is not possible to learn in
> an OK way.
> >
> > I agree with Brian. Download statistics could be a good start for some
> broad questions. We could consider tailoring our convenience binaries to
> facilitate better data gathering, such as a separate py3 pypi coordinate.
> Download stats on released container images could be a way to do this
> without inconveniencing users.
> >
> > Kenn
> >
> > On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
> >>
> >> Creating a central place for collecting Beam usage sounds compelling,
> but we'd have to be careful about several aspects:
> >> - It goes without saying that this can never be on-by-default, even for
> a tiny fraction of pipelines.
> >> - For further privacy protection, including the user's PipelineOptions
> is probably out of the question too: people might be including very
> sensitive data in their PipelineOptions (such as database passwords) and we
> wouldn't want to end up storing that data even due to a user's mistake. The
> only data that can be stored is data that Beam developers can guarantee is
> never sensitive, or data intentionally authored by a human for the purpose
> of reporting it to us (e.g. a hand-typed feedback message).
> >> - If it requires the user manually clicking the link, then it will not
> collect data about automated invocations of any pipelines, whereas likely
> almost all practical invocations are automated - the difference between
> COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
> >> - Moreover, many practical invocations likely go through an
> intermediate library / product, such as scio or talend. There'd need to be
> a story for library developers to offer this capability to their users.
> >> - The condition "was feedback reported for this pipeline", regardless
> of whether it is reported manually (by clicking the link) or automatically
> (by explicitly enabling some flag), heavily biases the sample - people are
> unlikely to click the link if the pipeline works fine (and almost all
> production pipelines work fine, otherwise they wouldn't be in production),
> and I don't know what considerations would prompt somebody to enable the
> flag for a periodic production pipeline. Meaning, the collected data likely
> can not be reliably used for any aggregation/counting, except for picking
> out interesting individual examples for case studies.
> >> - Measures should be taken to ensure that people don't accidentally
> enable it in their quick-running direct runner unit tests, causing lots of
> traffic.
> >> - I would not dismiss the possibility of spam and attacks.
> >>
> >> I'd recommend to start by listing the questions we're hoping to answer
> using the collected feedback, and then judging whether the proposed method
> indeed allows answering them while respecting the users' privacy.
> >>
> >> On Tue, Sep 24, 2019 at 1:49 PM Lukasz Cwik <lc...@google.com> wrote:
> >>>
> >>> One of the options could be to just display the URL and not to phone
> home. I would like it so that users can integrate this into their
> deployment solution so we get regular stats instead of only when a user
> decides to run a pipeline manually.
> >>>
> >>> On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >>>>
> >>>> I think the goal is to lower the barrier of entry. Displaying a URL to
> >>>> click on while waiting for your pipeline to start up, that contains
> >>>> all the data explicitly visible, is about as easy as it gets.
> >>>> Remembering to run a new (probably not as authentic) pipeline with
> >>>> that flag is less so.
> >>>>
> >>>> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin <mig...@google.com>
> wrote:
> >>>> >
> >>>> > I'm with Luke on this. We can add a set of flags to send home stats
> and crash dumps if user agrees. If we keep code isolated, it will be easy
> enough for user to check what is being sent.
> >>>> >
> >>>> > One more heavy-weight option is to also allow user configure and
> persist what information he is ok with sharing.
> >>>> >
> >>>> > --Mikhail
> >>>> >
> >>>> >
> >>>> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik <lc...@google.com>
> wrote:
> >>>> >>
> >>>> >> Why not add a flag to the SDK that would do the phone home when
> specified?
> >>>> >>
> >>>> >> From a support perspective it would be useful to know:
> >>>> >> * SDK version
> >>>> >> * Runner
> >>>> >> * SDK provided PTransforms that are used
> >>>> >> * Features like user state/timers/side inputs/splittable dofns/...
> >>>> >> * Graph complexity (# nodes, # branches, ...)
> >>>> >> * Pipeline failed or succeeded
> >>>> >>
> >>>> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw <
> rober...@google.com> wrote:
> >>>> >>>
> >>>> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette <
> bhule...@google.com> wrote:
> >>>> >>> >
> >>>> >>> > Would people actually click on that link though? I think Kyle
> has a point that in practice users would only find and click on that link
> when they're having some kind of issue, especially if the link has
> "feedback" in it.
> >>>> >>>
> >>>> >>> I think the idea is that we would make the link very light-weight,
> >>>> >>> kind of like a survey (but even easier as it's pre-populated).
> >>>> >>> Basically an opt-in phone-home. If we don't collect any personal
> data
> >>>> >>> (not even IP/geo, just (say) version + runner, all visible in the
> >>>> >>> URL), no need to guard/anonymize (and this may be sufficient--I
> don't
> >>>> >>> think we have to worry about spammers and ballot stuffers given
> the
> >>>> >>> target audience). If we can catch people while they wait for their
> >>>> >>> pipeline to start up (and/or complete), this is a great time to
> get
> >>>> >>> some feedback.
> >>>> >>>
> >>>> >>> > I agree usage data would be really valuable, but I'm not sure
> that this approach would get us good data. Is there a way to get download
> statistics for the different runner artifacts? Maybe that could be a better
> metric to compare usage.
> >>>> >>>
> >>>> >>> This'd be useful too, but hard to get and very noisy.
> >>>> >>>
> >>>> >>> >
> >>>> >>> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka <goe...@google.com>
> wrote:
> >>>> >>> >>
> >>>> >>> >> I agree, these are the questions that need to be answered.
> >>>> >>> >> The data can be anonymize and stored as public data in
> BigQuery or some other place.
> >>>> >>> >>
> >>>> >>> >> The intent is to get the usage statistics so that we can get
> to know what people are using Flink or Spark etc and not intended for
> discussion or a help channel.
> >>>> >>> >> I also think that we don't need to monitor this actively as
> it's more like a survey rather than active channel to get issues resolved.
> >>>> >>> >>
> >>>> >>> >> If we think its useful for the community then we come up with
> the solution as to how can we do this (similar to how we released the
> container images).
> >>>> >>> >>
> >>>> >>> >>
> >>>> >>> >>
> >>>> >>> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver <
> kcwea...@google.com> wrote:
> >>>> >>> >>>
> >>>> >>> >>> There are some logistics that would need worked out. For
> example, Where would the data go? Who would own it?
> >>>> >>> >>>
> >>>> >>> >>> Also, I'm not convinced we need yet another place to discuss
> Beam when we already have discussed the challenge of simultaneously
> monitoring mailing lists, Stack Overflow, Slack, etc. While "how do you use
> Beam" is certainly an interesting question, and I'd be curious to know that
> >= X many people use a certain runner, I'm not sure answers to these
> questions are as useful for guiding the future of Beam as discussions on
> the dev/users lists, etc. as the latter likely result in more
> depth/specific feedback.
> >>>> >>> >>>
> >>>> >>> >>> However, I do think it could be useful in general to include
> links directly in the console output. For example, maybe something along
> the lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask
> the mailing list."
> >>>> >>> >>>
> >>>> >>> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>>> >>> >>>
> >>>> >>> >>>
> >>>> >>> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka <
> goe...@google.com> wrote:
> >>>> >>> >>>>
> >>>> >>> >>>> Hi,
> >>>> >>> >>>>
> >>>> >>> >>>> At the moment we don't really have a good way to collect any
> usage statistics for Apache Beam. Like runner used etc. As many of the
> users don't really have a way to report their usecase.
> >>>> >>> >>>> How about if we create a feedback page where users can add
> their pipeline details and usecase.
> >>>> >>> >>>> Also, we can start printing the link to this page when user
> launch the pipeline in the command line.
> >>>> >>> >>>> Example:
> >>>> >>> >>>> $ python my_pipeline.py --runner DirectRunner --input
> /tmp/abc
> >>>> >>> >>>>
> >>>> >>> >>>> Starting pipeline
> >>>> >>> >>>> Please use
> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
> >>>> >>> >>>> Pipeline started
> >>>> >>> >>>> ......
> >>>> >>> >>>>
> >>>> >>> >>>> Using a link and not publishing the data automatically will
> give user control over what they publish and what they don't. We can
> enhance the text and usage further but the basic idea is to ask for user
> feeback at each run of the pipeline.
> >>>> >>> >>>> Let me know what you think.
> >>>> >>> >>>>
> >>>> >>> >>>>
> >>>> >>> >>>> Thanks,
> >>>> >>> >>>> Ankur
>

Re: Collecting feedback for Beam usage

Reply via email to