Ying Wang created BEAM-11905:
--------------------------------

             Summary: GCP DataFlow not cleaning up GCP BigQuery temporary 
datasets
                 Key: BEAM-11905
                 URL: https://issues.apache.org/jira/browse/BEAM-11905
             Project: Beam
          Issue Type: Bug
          Components: beam-community
    Affects Versions: 2.27.0
         Environment: GCP DataFlow
            Reporter: Ying Wang
            Assignee: Aizhamal Nurmamat kyzy


I'm running a number of GCP DataFlow jobs to transform some tables within GCP 
BigQuery, and they're creating a bunch of temporary datasets that are not 
deleted when the job completes successfully. I'm running the GCP DataFlow jobs 
by using Airflow / GCP Cloud Composer.

The Composer environment Airflow UI does not reveal anything. When I go into 
GCP DataFlow, click on a job named $BATCH_JOB marked with "Status: Succeeded" 
and "SDK version: 2.27.0", a step within that job and a stage within that step 
(?), and then open up the Logs window and filter for "LogLevel: Error" and 
click on a log message, I get this:

 

```bash

Error message from worker: Traceback (most recent call last): File 
"/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 
649, in do_work work_executor.execute() File 
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 226, 
in execute self._split_task) File 
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 234, 
in _perform_source_split_considering_api_limits desired_bundle_size) File 
"/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 271, 
in _perform_source_split for split in source.split(desired_bundle_size): File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 
796, in split schema, metadata_list = self._export_files(bq) File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 
881, in _export_files bq.wait_for_bq_job(job_ref) File 
"/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", 
line 525, in wait_for_bq_job job_reference.jobId, job.status.errorResult)) 
RuntimeError: BigQuery job 
beam_bq_job_EXPORT_latestrow060a408d75f23074efbacd477228b4b30bc_68cc517f-f_436 
failed. Error Result: <ErrorProto message: 'Not found: Table 
motorefi-analytics:temp_dataset_3a43c81c858e429f871d37802d7ac4f6.temp_table_3a43c81c858e429f871d37802d7ac4f6
 was not found in location US' reason: 'notFound'>

```

 

I would provide the equivalent REST for the batch job description but I'm not 
sure if it is helpful or sensitive information.

 

I'm not sure whether Beam v2.27.0 is affected by 
https://issues.apache.org/jira/browse/BEAM-6514 or 
[https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609,] but I am 
using the Python 3.7 SDK v2.27.0 and not the Java SDK.

 

Appreciate any help for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to