[
https://issues.apache.org/jira/browse/BEAM-11905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363784#comment-17363784
]
Beam JIRA Bot commented on BEAM-11905:
--------------------------------------
This issue was marked "stale-P2" and has not received a public comment in 14
days. It is now automatically moved to P3. If you are still affected by it, you
can comment and move it back to P2.
> GCP DataFlow not cleaning up GCP BigQuery temporary datasets
> ------------------------------------------------------------
>
> Key: BEAM-11905
> URL: https://issues.apache.org/jira/browse/BEAM-11905
> Project: Beam
> Issue Type: Improvement
> Components: io-py-gcp
> Affects Versions: 2.27.0
> Environment: GCP DataFlow
> Reporter: Ying Wang
> Priority: P3
>
> I'm running a number of GCP DataFlow jobs to transform some tables within GCP
> BigQuery, and they're creating a bunch of temporary datasets that are not
> deleted when the job completes successfully. I'm running the GCP DataFlow
> jobs by using Airflow / GCP Cloud Composer.
> The Composer environment Airflow UI does not reveal anything. When I go into
> GCP DataFlow, click on a job named $BATCH_JOB marked with "Status: Succeeded"
> and "SDK version: 2.27.0", a step within that job and a stage within that
> step (?), and then open up the Logs window and filter for "LogLevel: Error"
> and click on a log message, I get this:
>
> ```bash
> Error message from worker: Traceback (most recent call last): File
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line
> 649, in do_work work_executor.execute() File
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line
> 226, in execute self._split_task) File
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line
> 234, in _perform_source_split_considering_api_limits desired_bundle_size)
> File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py",
> line 271, in _perform_source_split for split in
> source.split(desired_bundle_size): File
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line
> 796, in split schema, metadata_list = self._export_files(bq) File
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line
> 881, in _export_files bq.wait_for_bq_job(job_ref) File
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py",
> line 525, in wait_for_bq_job job_reference.jobId, job.status.errorResult))
> RuntimeError: BigQuery job
> beam_bq_job_EXPORT_latestrow060a408d75f23074efbacd477228b4b30bc_68cc517f-f_436
> failed. Error Result: <ErrorProto message: 'Not found: Table
> motorefi-analytics:temp_dataset_3a43c81c858e429f871d37802d7ac4f6.temp_table_3a43c81c858e429f871d37802d7ac4f6
> was not found in location US' reason: 'notFound'>
> ```
>
> I would provide the equivalent REST for the batch job description but I'm not
> sure if it is helpful or sensitive information.
>
> I'm not sure whether Beam v2.27.0 is affected by
> https://issues.apache.org/jira/browse/BEAM-6514 or
> [https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609,] but I am
> using the Python 3.7 SDK v2.27.0 and not the Java SDK.
>
> Appreciate any help for this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)