Java SDK creates one regional bucket per project and region combination <https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L316-L318> . So it's not a lot of buckets - no need to auto-clean.
I agree with Robert that having less flags is better. Perhaps what we need a unifying interface for SDKs that simplifies launching? So instead of: mvn compile exec:java -Dexec.mainClass=<class> -Dexec.args="--runner=DataflowRunner --project=<project> --gcpTempLocation=gs://<bucket>/tmp <user flags>" -Pdataflow-runner or python -m <module> --runner DataflowRunner --project <project> --temp_location gs://<bucket>/tmp/ <user flags> We could have: ./beam java run <class> --runner=DataflowRunner <user flags> ./beam python run <module> --runner=DataflowRunner <user flags> where GCP project and temp_location are optional. On Tue, Jul 23, 2019 at 10:31 AM David Cavazos <dcava...@google.com> wrote: > I would go for #1 since it's a better user experience. Especially for new > users who don't understand every step involved on staging/deploying. It's > just another (unnecessary) mental concept they don't have to be aware of. > Anything that makes it closer to only providing the `--runner` flag without > any additional flags (by default, but configurable if necessary) is a good > thing in my opinion. > > AutoML already auto-creates a GCS bucket (not configurable, with a global > name which has its own downfalls). Other products are already doing this to > simplify user experience. I think as long as there's an explicit logging > statement it should be fine. > > If the bucket was not specified and was created: "No --temp_location > specified, created gs://..." > > If the bucket was not specified and was found: "No --temp_location > specified, found gs://..." > > If the bucket was specified, the logging could be omitted since it's > already explicit from the command line arguments. > > On Tue, Jul 23, 2019 at 10:25 AM Chamikara Jayalath <chamik...@google.com> > wrote: > >> Do we clean up auto created GCS buckets ? >> >> If there's no good way to cleanup, I think it might be better to make >> this opt-in. >> >> Thanks, >> Cham >> >> On Tue, Jul 23, 2019 at 3:25 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> I think having a single, default, auto-created temporary bucket per >>> project for use in GCP (when running on Dataflow, or running elsewhere >>> but using GCS such as for this BQ load files example), though not >>> ideal, is the best user experience. If we don't want to be >>> automatically creating such things for users by default, another >>> option would be a single flag that opts-in to such auto-creation >>> (which could include other resources in the future). >>> >>> On Tue, Jul 23, 2019 at 1:08 AM Pablo Estrada <pabl...@google.com> >>> wrote: >>> > >>> > Hello all, >>> > I recently worked on a transform to load data into BigQuery by writing >>> files to GCS, and issuing Load File jobs to BQ. I did this for the Python >>> SDK[1]. >>> > >>> > This option requires the user to provide a GCS bucket to write the >>> files: >>> > >>> > If the user provides a bucket to the transform, the SDK will use that >>> bucket. >>> > If the user does not provide a bucket: >>> > >>> > When running in Dataflow, the SDK will borrow the temp_location of the >>> pipeline. >>> > When running in other runners, the pipeline will fail. >>> > >>> > The Java SDK has had functionality for File Loads into BQ for a long >>> time; and particularly, when users do not provide a bucket, it attempts to >>> create a default bucket[2]; and this bucket is used as temp_location (which >>> then is used by the BQ File Loads transform). >>> > >>> > I do not really like creating GCS buckets on behalf of users. In Java, >>> the outcome is that users will not have to pass a --tempLocation parameter >>> when submitting jobs to Dataflow - which is a nice convenience, but I'm not >>> sure that this is in-line with users' expectations. >>> > >>> > Currently, the options are: >>> > >>> > Adding support for bucket autocreation for Python SDK >>> > Deprecating support for bucket autocreation in Java SDK, and printing >>> a warning. >>> > >>> > I am personally inclined for #1. But what do others think? >>> > >>> > Best >>> > -P. >>> > >>> > [1] https://github.com/apache/beam/pull/7892 >>> > [2] >>> https://github.com/apache/beam/blob/5b3807be717277e3e6880a760b036fecec3bc95d/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L294-L343 >>> >>
smime.p7s
Description: S/MIME Cryptographic Signature