Re: Collecting feedback for Beam usage

2019-09-26 Thread Kenneth Knowles
Ah, I didn't realize pypi was already collecting py2 vs py3. That saves
having to split artifacts.

Kenn

On Thu, Sep 26, 2019 at 5:03 PM Robert Bradshaw  wrote:

> Pypi download statistics are freely available at
> https://pypistats.org/packages/apache-beam . (To answer the original
> question, nearly all Python 2 at this point, but starting to show a
> drop.)
>
> I think the goal is to get more/orthogonal coverage than a twitter
> poll or waiting for users to speak up on the lists. Getting accurate
> stats (without violating many of the principles we all find vauable)
> woudl be much more difficult, if even possible. In this sense, the
> bias against a large number of production/automated runs doesn't hurt
> the goal of capturing the attention (needed if it's opt-in) of a large
> number of developers.
>
> On Tue, Sep 24, 2019 at 9:15 PM Kenneth Knowles  wrote:
> >
> > Agreeing with many things here and my own flavor to the points:
> > 1. User's privacy is more important than anything else
> > 2. The goal should be to make things better for users
> > 3. Trading user's opt-in for functionality (like Gradle scans) is not
> acceptable
> > 4. It should be effectively invisible to users who are not interested
> > 5. Ideally, we could find some people with expertise in (a) data
> gathering (b) usability (c) privacy (d) whatever we did not think of
> because it is not our expertise. So that we have confidence that our
> results are meaningful and we have done no harm.
> >
> > Some obvious data biases have been mentioned. Here's some more: a lot of
> Beam usage is probably through automation (such as k8s, cron, Airflow, etc)
> where a user is not present when a pipeline is launched. Logging would do
> nothing in these cases, except in case of a failure being debugged. I would
> guess this is the common case. The more a user is actually using Beam in
> production, the less likely they are watching job startup logs. Probably
> many companies use Beam to build a platform for their own users, so
> analytics may not capture the number of actual users in any meaningful way.
> Etc.
> >
> > Certainly, having a sense of the impact of changes like "deprecate
> Python 2" or "make breaking change to pipeline options for old FlinkRunner"
> would be extremely useful, both to us and to our users. We just need to be
> careful. And we must be ready to accept if this is not possible to learn in
> an OK way.
> >
> > I agree with Brian. Download statistics could be a good start for some
> broad questions. We could consider tailoring our convenience binaries to
> facilitate better data gathering, such as a separate py3 pypi coordinate.
> Download stats on released container images could be a way to do this
> without inconveniencing users.
> >
> > Kenn
> >
> > On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov 
> wrote:
> >>
> >> Creating a central place for collecting Beam usage sounds compelling,
> but we'd have to be careful about several aspects:
> >> - It goes without saying that this can never be on-by-default, even for
> a tiny fraction of pipelines.
> >> - For further privacy protection, including the user's PipelineOptions
> is probably out of the question too: people might be including very
> sensitive data in their PipelineOptions (such as database passwords) and we
> wouldn't want to end up storing that data even due to a user's mistake. The
> only data that can be stored is data that Beam developers can guarantee is
> never sensitive, or data intentionally authored by a human for the purpose
> of reporting it to us (e.g. a hand-typed feedback message).
> >> - If it requires the user manually clicking the link, then it will not
> collect data about automated invocations of any pipelines, whereas likely
> almost all practical invocations are automated - the difference between
> COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
> >> - Moreover, many practical invocations likely go through an
> intermediate library / product, such as scio or talend. There'd need to be
> a story for library developers to offer this capability to their users.
> >> - The condition "was feedback reported for this pipeline", regardless
> of whether it is reported manually (by clicking the link) or automatically
> (by explicitly enabling some flag), heavily biases the sample - people are
> unlikely to click the link if the pipeline works fine (and almost all
> production pipelines work fine, otherwise they wouldn't be in production),
> and I don't know what considerations would prompt somebody to enable the
> flag for a periodic production pipeline. Meaning, the collected data likely
> can not be reliably used for any aggregation/counting, except for picking
> out interesting individual examples for case studies.
> >> - Measures should be taken to ensure that people don't accidentally
> enable it in their quick-running direct runner unit tests, causing lots of
> traffic.
> >> - I would not dismiss the possibility of spam and attacks.
> >>
> >> 

Re: Collecting feedback for Beam usage

2019-09-26 Thread Robert Bradshaw
Pypi download statistics are freely available at
https://pypistats.org/packages/apache-beam . (To answer the original
question, nearly all Python 2 at this point, but starting to show a
drop.)

I think the goal is to get more/orthogonal coverage than a twitter
poll or waiting for users to speak up on the lists. Getting accurate
stats (without violating many of the principles we all find vauable)
woudl be much more difficult, if even possible. In this sense, the
bias against a large number of production/automated runs doesn't hurt
the goal of capturing the attention (needed if it's opt-in) of a large
number of developers.

On Tue, Sep 24, 2019 at 9:15 PM Kenneth Knowles  wrote:
>
> Agreeing with many things here and my own flavor to the points:
> 1. User's privacy is more important than anything else
> 2. The goal should be to make things better for users
> 3. Trading user's opt-in for functionality (like Gradle scans) is not 
> acceptable
> 4. It should be effectively invisible to users who are not interested
> 5. Ideally, we could find some people with expertise in (a) data gathering 
> (b) usability (c) privacy (d) whatever we did not think of because it is not 
> our expertise. So that we have confidence that our results are meaningful and 
> we have done no harm.
>
> Some obvious data biases have been mentioned. Here's some more: a lot of Beam 
> usage is probably through automation (such as k8s, cron, Airflow, etc) where 
> a user is not present when a pipeline is launched. Logging would do nothing 
> in these cases, except in case of a failure being debugged. I would guess 
> this is the common case. The more a user is actually using Beam in 
> production, the less likely they are watching job startup logs. Probably many 
> companies use Beam to build a platform for their own users, so analytics may 
> not capture the number of actual users in any meaningful way. Etc.
>
> Certainly, having a sense of the impact of changes like "deprecate Python 2" 
> or "make breaking change to pipeline options for old FlinkRunner" would be 
> extremely useful, both to us and to our users. We just need to be careful. 
> And we must be ready to accept if this is not possible to learn in an OK way.
>
> I agree with Brian. Download statistics could be a good start for some broad 
> questions. We could consider tailoring our convenience binaries to facilitate 
> better data gathering, such as a separate py3 pypi coordinate. Download stats 
> on released container images could be a way to do this without 
> inconveniencing users.
>
> Kenn
>
> On Tue, Sep 24, 2019 at 4:46 PM Eugene Kirpichov  wrote:
>>
>> Creating a central place for collecting Beam usage sounds compelling, but 
>> we'd have to be careful about several aspects:
>> - It goes without saying that this can never be on-by-default, even for a 
>> tiny fraction of pipelines.
>> - For further privacy protection, including the user's PipelineOptions is 
>> probably out of the question too: people might be including very sensitive 
>> data in their PipelineOptions (such as database passwords) and we wouldn't 
>> want to end up storing that data even due to a user's mistake. The only data 
>> that can be stored is data that Beam developers can guarantee is never 
>> sensitive, or data intentionally authored by a human for the purpose of 
>> reporting it to us (e.g. a hand-typed feedback message).
>> - If it requires the user manually clicking the link, then it will not 
>> collect data about automated invocations of any pipelines, whereas likely 
>> almost all practical invocations are automated - the difference between 
>> COUNT(DISTINCT) and COUNT(*), as far as pipelines go.
>> - Moreover, many practical invocations likely go through an intermediate 
>> library / product, such as scio or talend. There'd need to be a story for 
>> library developers to offer this capability to their users.
>> - The condition "was feedback reported for this pipeline", regardless of 
>> whether it is reported manually (by clicking the link) or automatically (by 
>> explicitly enabling some flag), heavily biases the sample - people are 
>> unlikely to click the link if the pipeline works fine (and almost all 
>> production pipelines work fine, otherwise they wouldn't be in production), 
>> and I don't know what considerations would prompt somebody to enable the 
>> flag for a periodic production pipeline. Meaning, the collected data likely 
>> can not be reliably used for any aggregation/counting, except for picking 
>> out interesting individual examples for case studies.
>> - Measures should be taken to ensure that people don't accidentally enable 
>> it in their quick-running direct runner unit tests, causing lots of traffic.
>> - I would not dismiss the possibility of spam and attacks.
>>
>> I'd recommend to start by listing the questions we're hoping to answer using 
>> the collected feedback, and then judging whether the proposed method indeed 
>> allows answering them while 

Re: No filesystem found for scheme s3 using FileIO.write()

2019-09-26 Thread Koprivica,Preston Blake
Hi max, 

I was able to test that change you suggested.  I went ahead and commented on 
the Jira 
https://issues.apache.org/jira/browse/BEAM-8303?focusedCommentId=16939002=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16939002.
  Unless others object, I'm going to move any further discussion to the Jira to 
keep a consistent record there. 

-Preston

On 9/25/19, 5:27 PM, "Koprivica,Preston Blake" 
 wrote:

Hi max,

I assume you were asking me.  I'm working on building the project from 
source and will the inject that code that you suggested.  I'll let you know if 
I have any success.

On 9/25/19, 4:05 PM, "Maximilian Michels"  wrote:

I wrote the same reply as you, but then deleted it again before sending
;) Given Preston's full description and a bit of Flink context, it is
pretty much impossible to have anything to do with service files.

The issue comes from a coder FileIO uses. The coder depends on the S3
file system (really any custom file system). If we use a native Flink
transformation, e.g. Union, we never get to load the FileSystems code in
the current classloader. However, the coder depends on the FileSystems
initialization code (which by the way has to be run "manually" because
it depends on the pipeline options), so it will error.

Note that FileIO usually consists of a cascade of different transforms
for which parts may execute on different machines. That's why we see
this error during `decode` on a remote host which had not been
initialized yet by one of the other instances of the initialization
code. Probably that is the receiving side of a Reshuffle.

Just to proof this theory, do you mind building Beam and testing your
pipeline with the following line added before line 75?

https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2F04dc3c3b14ab780e9736d5f769c6bf2a27a390bb%2Frunners%2Fflink%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fflink%2Ftranslation%2Ftypes%2FCoderTypeInformation.java%23L75data=02%7C01%7CPreston.B.Koprivica%40cerner.com%7C20a09459241b440b5b4708d74207957d%7Cfbc493a80d244454a815f4ca58e8c09d%7C0%7C0%7C637050472655447516sdata=Ou0LooAOYTlrJOEXrVlj%2FwnbMn7pRG6H5k98Oy3sljM%3Dreserved=0
FileSystems.setDefaultPipelineOptions(PipelineOptionsFactory.create());

We can later pass in the pipeline options to initialize this properly.

If this is too much, I'll probably give it a try tomorrow using your
example code.

Cheers,
Max

On 25.09.19 11:15, Kenneth Knowles wrote:
> Are you building a fat jar? I don't know if this is your issue. I 
don't
> know Flink's operation in any close detail, and I'm not sure how it
> relates to what Max has described. But it is a common cause of this 
kind
> of error.
>
> The registration of the filesystem is here:
> 
https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fbeam%2Fblob%2Fmaster%2Fsdks%2Fjava%2Fio%2Famazon-web-services%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Fsdk%2Fio%2Faws%2Fs3%2FS3FileSystemRegistrar.java%23L32data=02%7C01%7CPreston.B.Koprivica%40cerner.com%7C20a09459241b440b5b4708d74207957d%7Cfbc493a80d244454a815f4ca58e8c09d%7C0%7C0%7C637050472655447516sdata=H4yxhhbXdyx2ubcDxQ%2BqguJLjIQSZYIVez7PugWHpTE%3Dreserved=0.
> This results in the built jar for 
beam-sdks-java-io-amazon-web-services
> to have a file
> META-INF/services/org.apache.beam.sdk.io.FileSystemRegistrar with the
> line org.apache.beam.sdk.io.aws.s3.S3FileSystemRegistrar
>
> The file META-INF/services/org.apache.beam.sdk.io.FileSystemRegistrar
> exists in many dependencies, including the core SDK. The default for
> many fat jar tools (maven shade and gradle shadow) is that they
> nondeterministically clobber each other, and you have to add a line 
like
> "mergeServiceFiles" to your configuration to keep all the 
registrations.
>
> Kenn
>
> On Wed, Sep 25, 2019 at 8:36 AM Maximilian Michels  > wrote:
>
> Hey Preston,
>
> I just wrote a reply on the user mailing list. Copying the reply 
here
> just in case:
>
> 
>
> Your observation seems to be correct. There is an issue with the 
file
> system registration.
>
> The two types of errors you are seeing, as well as the successful 
run,
> are just due to the different structure of the generated 
transforms.
> The
> Flink scheduler will distribute them differently, which results 
in some
> pipelines being 

Re: contributor permission in jira and hello

2019-09-26 Thread Canaan Silberberg
There is a PR for this here now: https://github.com/apache/beam/pull/9524 ...if
anyone has a moment to review.

On Tue, Aug 20, 2019 at 1:35 PM Pablo Estrada  wrote:

> Ah this is a great feature. Thanks for looking into it!
>
> On Tue, Aug 20, 2019 at 12:44 AM Ismaël Mejía  wrote:
>
>> Hello Cannan,
>>
>> Welcome! You were added to the contributors role and the ticket was
>> assigned to you too. Now you can also self assign JIRAs if you want to
>> contribute in other areas.
>>
>>
>> On Mon, Aug 19, 2019 at 10:01 PM Canaan Silberberg 
>> wrote:
>> >
>> > HI all
>> >
>> > I'm working with beam's BigQueryIO over at Etsy Inc. and we're
>> interested in this feature:
>> https://issues.apache.org/jira/browse/BEAM-876
>> >
>> > ...which I see is unassigned. I'd like to implement and contribute it.
>> Could I have permission to self-assign it in Jira? (my jira user name is
>> ziel)
>>
>


Re: Do we know why gradle scans are not working?

2019-09-26 Thread Lukasz Cwik
The summary so far is that the spotless plugin is somehow interfering with
what is being published in a way where the build scan is broken.

I tried an older version of the spotless plugin and also the latest version
and neither worked. I then tried to remove the spotless plugin[1] and was
able to successfully publish a build scan[2].

Still waiting on an update in the forum about how to keep both spotless and
build scans working.

1:
https://github.com/lukecwik/incubator-beam/commit/1f026c9b94e3f8afc0cd472350b77c1eeef30a83
2: https://gradle.com/s/665zarmdoixh6


On Wed, Sep 25, 2019 at 8:57 AM Lukasz Cwik  wrote:

> I reached out on the Gradle forum
> https://discuss.gradle.org/t/your-build-scan-could-not-be-displayed-what-does-this-mean/33302
>
> On Wed, Sep 25, 2019 at 8:49 AM Łukasz Gajowy  wrote:
>
>> FWIW, I tried doing it locally and observed the same behavior. It works
>> in my other private projects. This is all I know for now.
>>
>> Łukasz
>>
>> wt., 24 wrz 2019 o 22:43 Lukasz Cwik  napisał(a):
>>
>>> Not to my knowledge. Maybe something is down.
>>>
>>> Have you tried running a gradle build locally with --scan?
>>>
>>> On Tue, Sep 24, 2019 at 1:03 PM Valentyn Tymofieiev 
>>> wrote:
>>>
 For example, https://gradle.com/s/mpfu3wpz2xfwe  says: Your build scan
 could not be displayed.

>>>