On Tue, Jul 23, 2019 at 1:10 PM Kyle Weaver <[email protected]> wrote:
> I agree with David that at least clearer log statements should be added. > > Udi, that's an interesting idea, but I imagine the sheer number of > existing flags (including many SDK-specific flags) would make it difficult > to implement. In addition, uniform argument names wouldn't necessarily > ensure uniform implementation. > > Kyle Weaver | Software Engineer | github.com/ibzib | [email protected] > > > On Tue, Jul 23, 2019 at 11:56 AM Udi Meiri <[email protected]> wrote: > >> Java SDK creates one regional bucket per project and region combination >> <https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L316-L318> >> . >> So it's not a lot of buckets - no need to auto-clean. >> > Agree that cleanup is not a bit issue if we are only creating a single bucket per project and region. I assume we are creating temporary folders for each pipeline with the same region and project so that they don't conclifc (which we clean up). As others mentioned we should clearly document this (including the naming of the bucket) and produce a log during pipeline creating. > >> I agree with Robert that having less flags is better. >> Perhaps what we need a unifying interface for SDKs that simplifies >> launching? >> >> So instead of: >> mvn compile exec:java -Dexec.mainClass=<class> >> -Dexec.args="--runner=DataflowRunner --project=<project> >> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner >> or >> python -m <module> --runner DataflowRunner --project >> <project> --temp_location gs://<bucket>/tmp/ <user flags> >> > Interesting, probably this should be extended to a generalized CLI for Beam that can be easily installed to execute Beam pipelines ? Thanks, Cham > >> We could have: >> ./beam java run <class> --runner=DataflowRunner <user flags> >> ./beam python run <module> --runner=DataflowRunner <user flags> >> >> where GCP project and temp_location are optional. >> >> On Tue, Jul 23, 2019 at 10:31 AM David Cavazos <[email protected]> >> wrote: >> >>> I would go for #1 since it's a better user experience. Especially for >>> new users who don't understand every step involved on staging/deploying. >>> It's just another (unnecessary) mental concept they don't have to be aware >>> of. Anything that makes it closer to only providing the `--runner` flag >>> without any additional flags (by default, but configurable if necessary) is >>> a good thing in my opinion. >>> >>> AutoML already auto-creates a GCS bucket (not configurable, with a >>> global name which has its own downfalls). Other products are already doing >>> this to simplify user experience. I think as long as there's an explicit >>> logging statement it should be fine. >>> >>> If the bucket was not specified and was created: "No --temp_location >>> specified, created gs://..." >>> >>> If the bucket was not specified and was found: "No --temp_location >>> specified, found gs://..." >>> >>> If the bucket was specified, the logging could be omitted since it's >>> already explicit from the command line arguments. >>> >>> On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath < >>> [email protected]> wrote: >>> >>>> Do we clean up auto created GCS buckets ? >>>> >>>> If there's no good way to cleanup, I think it might be better to make >>>> this opt-in. >>>> >>>> Thanks, >>>> Cham >>>> >>>> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <[email protected]> >>>> wrote: >>>> >>>>> I think having a single, default, auto-created temporary bucket per >>>>> project for use in GCP (when running on Dataflow, or running elsewhere >>>>> but using GCS such as for this BQ load files example), though not >>>>> ideal, is the best user experience. If we don't want to be >>>>> automatically creating such things for users by default, another >>>>> option would be a single flag that opts-in to such auto-creation >>>>> (which could include other resources in the future). >>>>> >>>>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <[email protected]> >>>>> wrote: >>>>> > >>>>> > Hello all, >>>>> > I recently worked on a transform to load data into BigQuery by >>>>> writing files to GCS, and issuing Load File jobs to BQ. I did this for the >>>>> Python SDK[1]. >>>>> > >>>>> > This option requires the user to provide a GCS bucket to write the >>>>> files: >>>>> > >>>>> > If the user provides a bucket to the transform, the SDK will use >>>>> that bucket. >>>>> > If the user does not provide a bucket: >>>>> > >>>>> > When running in Dataflow, the SDK will borrow the temp_location of >>>>> the pipeline. >>>>> > When running in other runners, the pipeline will fail. >>>>> > >>>>> > The Java SDK has had functionality for File Loads into BQ for a long >>>>> time; and particularly, when users do not provide a bucket, it attempts to >>>>> create a default bucket[2]; and this bucket is used as temp_location >>>>> (which >>>>> then is used by the BQ File Loads transform). >>>>> > >>>>> > I do not really like creating GCS buckets on behalf of users. In >>>>> Java, the outcome is that users will not have to pass a --tempLocation >>>>> parameter when submitting jobs to Dataflow - which is a nice convenience, >>>>> but I'm not sure that this is in-line with users' expectations. >>>>> > >>>>> > Currently, the options are: >>>>> > >>>>> > Adding support for bucket autocreation for Python SDK >>>>> > Deprecating support for bucket autocreation in Java SDK, and >>>>> printing a warning. >>>>> > >>>>> > I am personally inclined for #1. But what do others think? >>>>> > >>>>> > Best >>>>> > -P. >>>>> > >>>>> > [1] https://github.com/apache/beam/pull/7892 >>>>> > [2] >>>>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343 >>>>> >>>>
