I agree with the benefits of auto-creating buckets from an ease of use
perspective. My counter argument is that the auto created buckets may not
have the right settings for the users. A bucket has multiple settings, some
required as (name, storage class) and some optional (acl policy,
encryption, retention policy, labels). As the number of options increase
our chances of having a good enough default goes down. For example, if a
user wants to enable CMEK mode for encryption, they will enable it for
their sources, sinks, and will instruct Dataflow runner encrypt its
in-flight data. Creating a default (non-encrpyted) temp bucket for this
user would be against user's intentions. We would not be able to create a
bucket either, because we would not know what encryption keys to use for
such a bucket. Our options would be to either not create a bucket at all,
or fail if a temporary bucket was not specified and a CMEK mode is enabled.

There is a similar issue with the region flag. If unspecified it defaults
to us-central1. This is convenient for new users, but not making that flag
required will expose a larger proportion of Dataflow users to events in
that specific region.

Robert's suggestion of having a flag for opt-in to a default set of GCP
convenience flags sounds reasonable. At least users will explicitly
acknowledge that certain things are auto managed for them.

On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote:

> Another idea would be to put default bucket preferences in a .beamrc file
> so you don't have to remember to pass it every time (this could also
> contain other default flag values).
>

IMO, the first question is whether auto-creation based on some
unconfigurable defaults would happen or not. Once we agree on that, having
an rc file vs flags vs supporting both would be a UX question.


>
>

>
>
> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath
>> <chamik...@google.com> wrote:
>> >
>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kcwea...@google.com>
>> wrote:
>> >>
>> >> I agree with David that at least clearer log statements should be
>> added.
>> >>
>> >> Udi, that's an interesting idea, but I imagine the sheer number of
>> existing flags (including many SDK-specific flags) would make it difficult
>> to implement. In addition, uniform argument names wouldn't necessarily
>> ensure uniform implementation.
>> >>
>> >> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com
>> >>
>> >>
>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote:
>> >>>
>> >>> Java SDK creates one regional bucket per project and region
>> combination.
>> >>> So it's not a lot of buckets - no need to auto-clean.
>> >
>> >
>> > Agree that cleanup is not a bit issue if we are only creating a single
>> bucket per project and region. I assume we are creating temporary folders
>> for each pipeline with the same region and project so that they don't
>> conclifc (which we clean up).
>> > As others mentioned we should clearly document this (including the
>> naming of the bucket) and produce a log during pipeline creating.
>> >
>> >>>
>> >>>
>> >>> I agree with Robert that having less flags is better.
>> >>> Perhaps what we need a unifying interface for SDKs that simplifies
>> launching?
>> >>>
>> >>> So instead of:
>> >>> mvn compile exec:java -Dexec.mainClass=<class>
>> -Dexec.args="--runner=DataflowRunner --project=<project>
>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner
>> >>> or
>> >>> python -m <module> --runner DataflowRunner --project <project>
>> --temp_location gs://<bucket>/tmp/ <user flags>
>> >
>> > Interesting, probably this should be extended to a generalized CLI for
>> Beam that can be easily installed to execute Beam pipelines ?
>>
>> This is starting to get somewhat off-topic from the original question,
>> but I'm not sure the benefits of providing a wrapper to the end user
>> would outweigh the costs of having to learn the wrapper. For Python
>> developers, python -m module, or even python -m path/to/script.py is
>> pretty standard. Java is a bit harder, because one needs to coordinate
>> a build as well, but I don't know how a "./beam java ..." script would
>> gloss over whether one is using maven, gradle, ant, or just has a pile
>> of pre-compiled jara (and would probably have to know a bit about the
>> project layout as well to invoke the right commands).
>>
>

Reply via email to