Ah, I didn't realize pypi was already collecting py2 vs py3. That saves having to split artifacts.
Kenn On Thu, Sep 26, 2019 at 5:03 PM Robert Bradshaw <rober...@google.com> wrote: > Pypi download statistics are freely available at > https://pypistats.org/packages/apache-beam . (To answer the original > question, nearly all Python 2 at this point, but starting to show a > drop.) > > I think the goal is to get more/orthogonal coverage than a twitter > poll or waiting for users to speak up on the lists. Getting accurate > stats (without violating many of the principles we all find vauable) > woudl be much more difficult, if even possible. In this sense, the > bias against a large number of production/automated runs doesn't hurt > the goal of capturing the attention (needed if it's opt-in) of a large > number of developers. > > On Tue, Sep 24, 2019 at 9:15 PM Kenneth Knowles <k...@apache.org> wrote: > > > > Agreeing with many things here and my own flavor to the points: > > 1. User's privacy is more important than anything else > > 2. The goal should be to make things better for users > > 3. Trading user's opt-in for functionality (like Gradle scans) is not > acceptable > > 4. It should be effectively invisible to users who are not interested > > 5. Ideally, we could find some people with expertise in (a) data > gathering (b) usability (c) privacy (d) whatever we did not think of > because it is not our expertise. So that we have confidence that our > results are meaningful and we have done no harm. > > > > Some obvious data biases have been mentioned. Here's some more: a lot of > Beam usage is probably through automation (such as k8s, cron, Airflow, etc) > where a user is not present when a pipeline is launched. Logging would do > nothing in these cases, except in case of a failure being debugged. I would > guess this is the common case. The more a user is actually using Beam in > production, the less likely they are watching job startup logs. Probably > many companies use Beam to build a platform for their own users, so > analytics may not capture the number of actual users in any meaningful way. > Etc. > > > > Certainly, having a sense of the impact of changes like "deprecate > Python 2" or "make breaking change to pipeline options for old FlinkRunner" > would be extremely useful, both to us and to our users. We just need to be > careful. And we must be ready to accept if this is not possible to learn in > an OK way. > > > > I agree with Brian. Download statistics could be a good start for some > broad questions. We could consider tailoring our convenience binaries to > facilitate better data gathering, such as a separate py3 pypi coordinate. > Download stats on released container images could be a way to do this > without inconveniencing users. > > > > Kenn > > > > On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov <kirpic...@google.com> > wrote: > >> > >> Creating a central place for collecting Beam usage sounds compelling, > but we'd have to be careful about several aspects: > >> - It goes without saying that this can never be on-by-default, even for > a tiny fraction of pipelines. > >> - For further privacy protection, including the user's PipelineOptions > is probably out of the question too: people might be including very > sensitive data in their PipelineOptions (such as database passwords) and we > wouldn't want to end up storing that data even due to a user's mistake. The > only data that can be stored is data that Beam developers can guarantee is > never sensitive, or data intentionally authored by a human for the purpose > of reporting it to us (e.g. a hand-typed feedback message). > >> - If it requires the user manually clicking the link, then it will not > collect data about automated invocations of any pipelines, whereas likely > almost all practical invocations are automated - the difference between > COUNT(DISTINCT) and COUNT(*), as far as pipelines go. > >> - Moreover, many practical invocations likely go through an > intermediate library / product, such as scio or talend. There'd need to be > a story for library developers to offer this capability to their users. > >> - The condition "was feedback reported for this pipeline", regardless > of whether it is reported manually (by clicking the link) or automatically > (by explicitly enabling some flag), heavily biases the sample - people are > unlikely to click the link if the pipeline works fine (and almost all > production pipelines work fine, otherwise they wouldn't be in production), > and I don't know what considerations would prompt somebody to enable the > flag for a periodic production pipeline. Meaning, the collected data likely > can not be reliably used for any aggregation/counting, except for picking > out interesting individual examples for case studies. > >> - Measures should be taken to ensure that people don't accidentally > enable it in their quick-running direct runner unit tests, causing lots of > traffic. > >> - I would not dismiss the possibility of spam and attacks. > >> > >> I'd recommend to start by listing the questions we're hoping to answer > using the collected feedback, and then judging whether the proposed method > indeed allows answering them while respecting the users' privacy. > >> > >> On Tue, Sep 24, 2019 at 1:49 PM Lukasz Cwik <lc...@google.com> wrote: > >>> > >>> One of the options could be to just display the URL and not to phone > home. I would like it so that users can integrate this into their > deployment solution so we get regular stats instead of only when a user > decides to run a pipeline manually. > >>> > >>> On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw <rober...@google.com> > wrote: > >>>> > >>>> I think the goal is to lower the barrier of entry. Displaying a URL to > >>>> click on while waiting for your pipeline to start up, that contains > >>>> all the data explicitly visible, is about as easy as it gets. > >>>> Remembering to run a new (probably not as authentic) pipeline with > >>>> that flag is less so. > >>>> > >>>> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin <mig...@google.com> > wrote: > >>>> > > >>>> > I'm with Luke on this. We can add a set of flags to send home stats > and crash dumps if user agrees. If we keep code isolated, it will be easy > enough for user to check what is being sent. > >>>> > > >>>> > One more heavy-weight option is to also allow user configure and > persist what information he is ok with sharing. > >>>> > > >>>> > --Mikhail > >>>> > > >>>> > > >>>> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik <lc...@google.com> > wrote: > >>>> >> > >>>> >> Why not add a flag to the SDK that would do the phone home when > specified? > >>>> >> > >>>> >> From a support perspective it would be useful to know: > >>>> >> * SDK version > >>>> >> * Runner > >>>> >> * SDK provided PTransforms that are used > >>>> >> * Features like user state/timers/side inputs/splittable dofns/... > >>>> >> * Graph complexity (# nodes, # branches, ...) > >>>> >> * Pipeline failed or succeeded > >>>> >> > >>>> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw < > rober...@google.com> wrote: > >>>> >>> > >>>> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette < > bhule...@google.com> wrote: > >>>> >>> > > >>>> >>> > Would people actually click on that link though? I think Kyle > has a point that in practice users would only find and click on that link > when they're having some kind of issue, especially if the link has > "feedback" in it. > >>>> >>> > >>>> >>> I think the idea is that we would make the link very light-weight, > >>>> >>> kind of like a survey (but even easier as it's pre-populated). > >>>> >>> Basically an opt-in phone-home. If we don't collect any personal > data > >>>> >>> (not even IP/geo, just (say) version + runner, all visible in the > >>>> >>> URL), no need to guard/anonymize (and this may be sufficient--I > don't > >>>> >>> think we have to worry about spammers and ballot stuffers given > the > >>>> >>> target audience). If we can catch people while they wait for their > >>>> >>> pipeline to start up (and/or complete), this is a great time to > get > >>>> >>> some feedback. > >>>> >>> > >>>> >>> > I agree usage data would be really valuable, but I'm not sure > that this approach would get us good data. Is there a way to get download > statistics for the different runner artifacts? Maybe that could be a better > metric to compare usage. > >>>> >>> > >>>> >>> This'd be useful too, but hard to get and very noisy. > >>>> >>> > >>>> >>> > > >>>> >>> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka <goe...@google.com> > wrote: > >>>> >>> >> > >>>> >>> >> I agree, these are the questions that need to be answered. > >>>> >>> >> The data can be anonymize and stored as public data in > BigQuery or some other place. > >>>> >>> >> > >>>> >>> >> The intent is to get the usage statistics so that we can get > to know what people are using Flink or Spark etc and not intended for > discussion or a help channel. > >>>> >>> >> I also think that we don't need to monitor this actively as > it's more like a survey rather than active channel to get issues resolved. > >>>> >>> >> > >>>> >>> >> If we think its useful for the community then we come up with > the solution as to how can we do this (similar to how we released the > container images). > >>>> >>> >> > >>>> >>> >> > >>>> >>> >> > >>>> >>> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver < > kcwea...@google.com> wrote: > >>>> >>> >>> > >>>> >>> >>> There are some logistics that would need worked out. For > example, Where would the data go? Who would own it? > >>>> >>> >>> > >>>> >>> >>> Also, I'm not convinced we need yet another place to discuss > Beam when we already have discussed the challenge of simultaneously > monitoring mailing lists, Stack Overflow, Slack, etc. While "how do you use > Beam" is certainly an interesting question, and I'd be curious to know that > >= X many people use a certain runner, I'm not sure answers to these > questions are as useful for guiding the future of Beam as discussions on > the dev/users lists, etc. as the latter likely result in more > depth/specific feedback. > >>>> >>> >>> > >>>> >>> >>> However, I do think it could be useful in general to include > links directly in the console output. For example, maybe something along > the lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask > the mailing list." > >>>> >>> >>> > >>>> >>> >>> Kyle Weaver | Software Engineer | github.com/ibzib | > kcwea...@google.com > >>>> >>> >>> > >>>> >>> >>> > >>>> >>> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka < > goe...@google.com> wrote: > >>>> >>> >>>> > >>>> >>> >>>> Hi, > >>>> >>> >>>> > >>>> >>> >>>> At the moment we don't really have a good way to collect any > usage statistics for Apache Beam. Like runner used etc. As many of the > users don't really have a way to report their usecase. > >>>> >>> >>>> How about if we create a feedback page where users can add > their pipeline details and usecase. > >>>> >>> >>>> Also, we can start printing the link to this page when user > launch the pipeline in the command line. > >>>> >>> >>>> Example: > >>>> >>> >>>> $ python my_pipeline.py --runner DirectRunner --input > /tmp/abc > >>>> >>> >>>> > >>>> >>> >>>> Starting pipeline > >>>> >>> >>>> Please use > http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc > >>>> >>> >>>> Pipeline started > >>>> >>> >>>> ...... > >>>> >>> >>>> > >>>> >>> >>>> Using a link and not publishing the data automatically will > give user control over what they publish and what they don't. We can > enhance the text and usage further but the basic idea is to ask for user > feeback at each run of the pipeline. > >>>> >>> >>>> Let me know what you think. > >>>> >>> >>>> > >>>> >>> >>>> > >>>> >>> >>>> Thanks, > >>>> >>> >>>> Ankur >