I agree with the benefits of auto-creating buckets from an ease of use perspective. My counter argument is that the auto created buckets may not have the right settings for the users. A bucket has multiple settings, some required as (name, storage class) and some optional (acl policy, encryption, retention policy, labels). As the number of options increase our chances of having a good enough default goes down. For example, if a user wants to enable CMEK mode for encryption, they will enable it for their sources, sinks, and will instruct Dataflow runner encrypt its in-flight data. Creating a default (non-encrpyted) temp bucket for this user would be against user's intentions. We would not be able to create a bucket either, because we would not know what encryption keys to use for such a bucket. Our options would be to either not create a bucket at all, or fail if a temporary bucket was not specified and a CMEK mode is enabled.
There is a similar issue with the region flag. If unspecified it defaults to us-central1. This is convenient for new users, but not making that flag required will expose a larger proportion of Dataflow users to events in that specific region. Robert's suggestion of having a flag for opt-in to a default set of GCP convenience flags sounds reasonable. At least users will explicitly acknowledge that certain things are auto managed for them. On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <eh...@google.com> wrote: > Another idea would be to put default bucket preferences in a .beamrc file > so you don't have to remember to pass it every time (this could also > contain other default flag values). > IMO, the first question is whether auto-creation based on some unconfigurable defaults would happen or not. Once we agree on that, having an rc file vs flags vs supporting both would be a UX question. > > > > > On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <rober...@google.com> > wrote: > >> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath >> <chamik...@google.com> wrote: >> > >> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <kcwea...@google.com> >> wrote: >> >> >> >> I agree with David that at least clearer log statements should be >> added. >> >> >> >> Udi, that's an interesting idea, but I imagine the sheer number of >> existing flags (including many SDK-specific flags) would make it difficult >> to implement. In addition, uniform argument names wouldn't necessarily >> ensure uniform implementation. >> >> >> >> Kyle Weaver | Software Engineer | github.com/ibzib | >> kcwea...@google.com >> >> >> >> >> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <eh...@google.com> wrote: >> >>> >> >>> Java SDK creates one regional bucket per project and region >> combination. >> >>> So it's not a lot of buckets - no need to auto-clean. >> > >> > >> > Agree that cleanup is not a bit issue if we are only creating a single >> bucket per project and region. I assume we are creating temporary folders >> for each pipeline with the same region and project so that they don't >> conclifc (which we clean up). >> > As others mentioned we should clearly document this (including the >> naming of the bucket) and produce a log during pipeline creating. >> > >> >>> >> >>> >> >>> I agree with Robert that having less flags is better. >> >>> Perhaps what we need a unifying interface for SDKs that simplifies >> launching? >> >>> >> >>> So instead of: >> >>> mvn compile exec:java -Dexec.mainClass=<class> >> -Dexec.args="--runner=DataflowRunner --project=<project> >> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner >> >>> or >> >>> python -m <module> --runner DataflowRunner --project <project> >> --temp_location gs://<bucket>/tmp/ <user flags> >> > >> > Interesting, probably this should be extended to a generalized CLI for >> Beam that can be easily installed to execute Beam pipelines ? >> >> This is starting to get somewhat off-topic from the original question, >> but I'm not sure the benefits of providing a wrapper to the end user >> would outweigh the costs of having to learn the wrapper. For Python >> developers, python -m module, or even python -m path/to/script.py is >> pretty standard. Java is a bit harder, because one needs to coordinate >> a build as well, but I don't know how a "./beam java ..." script would >> gloss over whether one is using maven, gradle, ant, or just has a pile >> of pre-compiled jara (and would probably have to know a bit about the >> project layout as well to invoke the right commands). >> >