[ 
https://issues.apache.org/jira/browse/BEAM-11905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363784#comment-17363784
 ] 

Beam JIRA Bot commented on BEAM-11905:
--------------------------------------

This issue was marked "stale-P2" and has not received a public comment in 14 
days. It is now automatically moved to P3. If you are still affected by it, you 
can comment and move it back to P2.

> GCP DataFlow not cleaning up GCP BigQuery temporary datasets
> ------------------------------------------------------------
>
>                 Key: BEAM-11905
>                 URL: https://issues.apache.org/jira/browse/BEAM-11905
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-py-gcp
>    Affects Versions: 2.27.0
>         Environment: GCP DataFlow
>            Reporter: Ying Wang
>            Priority: P3
>
> I'm running a number of GCP DataFlow jobs to transform some tables within GCP 
> BigQuery, and they're creating a bunch of temporary datasets that are not 
> deleted when the job completes successfully. I'm running the GCP DataFlow 
> jobs by using Airflow / GCP Cloud Composer.
> The Composer environment Airflow UI does not reveal anything. When I go into 
> GCP DataFlow, click on a job named $BATCH_JOB marked with "Status: Succeeded" 
> and "SDK version: 2.27.0", a step within that job and a stage within that 
> step (?), and then open up the Logs window and filter for "LogLevel: Error" 
> and click on a log message, I get this:
>  
> ```bash
> Error message from worker: Traceback (most recent call last): File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 
> 649, in do_work work_executor.execute() File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 
> 226, in execute self._split_task) File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 
> 234, in _perform_source_split_considering_api_limits desired_bundle_size) 
> File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 271, in _perform_source_split for split in 
> source.split(desired_bundle_size): File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 
> 796, in split schema, metadata_list = self._export_files(bq) File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 
> 881, in _export_files bq.wait_for_bq_job(job_ref) File 
> "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py",
>  line 525, in wait_for_bq_job job_reference.jobId, job.status.errorResult)) 
> RuntimeError: BigQuery job 
> beam_bq_job_EXPORT_latestrow060a408d75f23074efbacd477228b4b30bc_68cc517f-f_436
>  failed. Error Result: <ErrorProto message: 'Not found: Table 
> motorefi-analytics:temp_dataset_3a43c81c858e429f871d37802d7ac4f6.temp_table_3a43c81c858e429f871d37802d7ac4f6
>  was not found in location US' reason: 'notFound'>
> ```
>  
> I would provide the equivalent REST for the batch job description but I'm not 
> sure if it is helpful or sensitive information.
>  
> I'm not sure whether Beam v2.27.0 is affected by 
> https://issues.apache.org/jira/browse/BEAM-6514 or 
> [https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609,] but I am 
> using the Python 3.7 SDK v2.27.0 and not the Java SDK.
>  
> Appreciate any help for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to