Ying Wang created BEAM-11905:
--------------------------------
Summary: GCP DataFlow not cleaning up GCP BigQuery temporary
datasets
Key: BEAM-11905
URL: https://issues.apache.org/jira/browse/BEAM-11905
Project: Beam
Issue Type: Bug
Components: beam-community
Affects Versions: 2.27.0
Environment: GCP DataFlow
Reporter: Ying Wang
Assignee: Aizhamal Nurmamat kyzy
I'm running a number of GCP DataFlow jobs to transform some tables within GCP
BigQuery, and they're creating a bunch of temporary datasets that are not
deleted when the job completes successfully. I'm running the GCP DataFlow jobs
by using Airflow / GCP Cloud Composer.
The Composer environment Airflow UI does not reveal anything. When I go into
GCP DataFlow, click on a job named $BATCH_JOB marked with "Status: Succeeded"
and "SDK version: 2.27.0", a step within that job and a stage within that step
(?), and then open up the Logs window and filter for "LogLevel: Error" and
click on a log message, I get this:
```bash
Error message from worker: Traceback (most recent call last): File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line
649, in do_work work_executor.execute() File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 226,
in execute self._split_task) File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 234,
in _perform_source_split_considering_api_limits desired_bundle_size) File
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 271,
in _perform_source_split for split in source.split(desired_bundle_size): File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line
796, in split schema, metadata_list = self._export_files(bq) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line
881, in _export_files bq.wait_for_bq_job(job_ref) File
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py",
line 525, in wait_for_bq_job job_reference.jobId, job.status.errorResult))
RuntimeError: BigQuery job
beam_bq_job_EXPORT_latestrow060a408d75f23074efbacd477228b4b30bc_68cc517f-f_436
failed. Error Result: <ErrorProto message: 'Not found: Table
motorefi-analytics:temp_dataset_3a43c81c858e429f871d37802d7ac4f6.temp_table_3a43c81c858e429f871d37802d7ac4f6
was not found in location US' reason: 'notFound'>
```
I would provide the equivalent REST for the batch job description but I'm not
sure if it is helpful or sensitive information.
I'm not sure whether Beam v2.27.0 is affected by
https://issues.apache.org/jira/browse/BEAM-6514 or
[https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609,] but I am
using the Python 3.7 SDK v2.27.0 and not the Java SDK.
Appreciate any help for this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)