I do not have a strong opinion about this either way. I think this is fundamentally a UX tradeoff between making it easier to get started and potentially creating unwanted/misconfigured items. I do not have data about what would be more preferable for most users. I believe either option would be fine as long as we are clear with our messaging, logs, errors.
On Mon, Jun 22, 2020 at 1:48 PM Luke Cwik <[email protected]> wrote: > I think creating the bucket makes sense since it is an improvement in the > users experience and simplifies first time users setup needs. We should be > clear to tell users that we are doing this on their behalf. > > On Mon, Jun 22, 2020 at 1:26 PM Pablo Estrada <[email protected]> wrote: > >> Hi everyone, >> I've gotten around to making this change, and Udi has been gracious to >> review it[1]. >> >> I figured we have not fully answered the larger question of whether we >> would truly like to make this change. Here are some thoughts giving me >> pause: >> >> 1. Appropriate defaults - We are not sure we can select appropriate >> defaults on behalf of users. (We are erroring out in case of KMS keys, but >> how about other properties?) >> 2. Users have been using Beam's Python SDK the way it is for a long time >> now: Supplying temp_location when running on Dataflow, without a problem. >> 3. This has billing implications that users may not be fully aware of >> >> The behavior in [1] matches the behavior of the Java SDK (create a bucket >> when none is supplied AND running on Dataflow); but it still doesn't solve >> the problem of ReadFromBQ/WriteToBQ from non-Dataflow runners (this can be >> done in a follow up change using the Default Bucket functionality). >> >> My bias in this case is: If it isn't broken, why fix it? I do not know of >> anyone complaining about the required temp_location flag on Dataflow. >> >> I think we can create a default bucket when dealing with BQ outside of >> Dataflow, but for Dataflow, I think we don't need to fix what's not broken. >> What do others think? >> >> Best >> -P. >> >> [1] https://github.com/apache/beam/pull/11982 >> >> On Tue, Jul 23, 2019 at 5:02 PM Ahmet Altay <[email protected]> wrote: >> >>> I agree with the benefits of auto-creating buckets from an ease of use >>> perspective. My counter argument is that the auto created buckets may not >>> have the right settings for the users. A bucket has multiple settings, some >>> required as (name, storage class) and some optional (acl policy, >>> encryption, retention policy, labels). As the number of options increase >>> our chances of having a good enough default goes down. For example, if a >>> user wants to enable CMEK mode for encryption, they will enable it for >>> their sources, sinks, and will instruct Dataflow runner encrypt its >>> in-flight data. Creating a default (non-encrpyted) temp bucket for this >>> user would be against user's intentions. We would not be able to create a >>> bucket either, because we would not know what encryption keys to use for >>> such a bucket. Our options would be to either not create a bucket at all, >>> or fail if a temporary bucket was not specified and a CMEK mode is enabled. >>> >>> There is a similar issue with the region flag. If unspecified it >>> defaults to us-central1. This is convenient for new users, but not making >>> that flag required will expose a larger proportion of Dataflow users to >>> events in that specific region. >>> >>> Robert's suggestion of having a flag for opt-in to a default set of GCP >>> convenience flags sounds reasonable. At least users will explicitly >>> acknowledge that certain things are auto managed for them. >>> >>> On Tue, Jul 23, 2019 at 4:28 PM Udi Meiri <[email protected]> wrote: >>> >>>> Another idea would be to put default bucket preferences in a .beamrc >>>> file so you don't have to remember to pass it every time (this could also >>>> contain other default flag values). >>>> >>> >>> IMO, the first question is whether auto-creation based on some >>> unconfigurable defaults would happen or not. Once we agree on that, having >>> an rc file vs flags vs supporting both would be a UX question. >>> >>> >>>> >>>> >>> >>>> >>>> >>>> On Tue, Jul 23, 2019 at 1:43 PM Robert Bradshaw <[email protected]> >>>> wrote: >>>> >>>>> On Tue, Jul 23, 2019 at 10:26 PM Chamikara Jayalath >>>>> <[email protected]> wrote: >>>>> > >>>>> > On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <[email protected]> >>>>> wrote: >>>>> >> >>>>> >> I agree with David that at least clearer log statements should be >>>>> added. >>>>> >> >>>>> >> Udi, that's an interesting idea, but I imagine the sheer number of >>>>> existing flags (including many SDK-specific flags) would make it difficult >>>>> to implement. In addition, uniform argument names wouldn't necessarily >>>>> ensure uniform implementation. >>>>> >> >>>>> >> Kyle Weaver | Software Engineer | github.com/ibzib | >>>>> [email protected] >>>>> >> >>>>> >> >>>>> >> On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <[email protected]> >>>>> wrote: >>>>> >>> >>>>> >>> Java SDK creates one regional bucket per project and region >>>>> combination. >>>>> >>> So it's not a lot of buckets - no need to auto-clean. >>>>> > >>>>> > >>>>> > Agree that cleanup is not a bit issue if we are only creating a >>>>> single bucket per project and region. I assume we are creating temporary >>>>> folders for each pipeline with the same region and project so that they >>>>> don't conclifc (which we clean up). >>>>> > As others mentioned we should clearly document this (including the >>>>> naming of the bucket) and produce a log during pipeline creating. >>>>> > >>>>> >>> >>>>> >>> >>>>> >>> I agree with Robert that having less flags is better. >>>>> >>> Perhaps what we need a unifying interface for SDKs that simplifies >>>>> launching? >>>>> >>> >>>>> >>> So instead of: >>>>> >>> mvn compile exec:java -Dexec.mainClass=<class> >>>>> -Dexec.args="--runner=DataflowRunner --project=<project> >>>>> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner >>>>> >>> or >>>>> >>> python -m <module> --runner DataflowRunner --project <project> >>>>> --temp_location gs://<bucket>/tmp/ <user flags> >>>>> > >>>>> > Interesting, probably this should be extended to a generalized CLI >>>>> for Beam that can be easily installed to execute Beam pipelines ? >>>>> >>>>> This is starting to get somewhat off-topic from the original question, >>>>> but I'm not sure the benefits of providing a wrapper to the end user >>>>> would outweigh the costs of having to learn the wrapper. For Python >>>>> developers, python -m module, or even python -m path/to/script.py is >>>>> pretty standard. Java is a bit harder, because one needs to coordinate >>>>> a build as well, but I don't know how a "./beam java ..." script would >>>>> gloss over whether one is using maven, gradle, ant, or just has a pile >>>>> of pre-compiled jara (and would probably have to know a bit about the >>>>> project layout as well to invoke the right commands). >>>>> >>>>
