[ 
https://issues.apache.org/jira/browse/BEAM-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884850#comment-16884850
 ] 

Valentyn Tymofieiev commented on BEAM-6202:
-------------------------------------------

This issue contributes to postsubmit flakiness, cc: [~udim].

Example manifestation of this error that we may be able to fix in the SDK:

poll_for_job_completion thread [1] encounters 503 :

{noformat}
16:55:27 root: INFO: 2019-07-14T23:38:13.223Z: JOB_MESSAGE_DETAILED: Workers 
have started successfully.
16:55:27 root: DEBUG: Response returned status 503, retrying
16:55:27 root: DEBUG: Retrying request to url 
https://dataflow.googleapis.com/v1b3/projects/apache-beam-testing/locations/us-central1/jobs/2019-07-14_16_36_07-9641014797750228421?alt=json
 after exception HttpError accessing 
<https://dataflow.googleapis.com/v1b3/projects/apache-beam-testing/locations/us-central1/jobs/2019-07-14_16_36_07-9641014797750228421?alt=json>:
 response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 
'application/json; charset=UTF-8', 'date': 'Sun, 14 Jul 2019 23:39:25 GMT', 
'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 
'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 
'transfer-encoding': 'chunked', 'status': '503', 'content-length': '102', 
'-content-encoding': 'gzip'}>, content <{
16:55:27   "error": {
16:55:27     "code": 503,
16:55:27     "message": "Deadline exceeded",
16:55:27     "status": "UNAVAILABLE"
16:55:27   }
16:55:27 }
16:55:27 >
{noformat}

Based on console logs, execution continues, and finishes successfully:

{noformat}
16:55:27 root: INFO: 2019-07-14T23:39:12.556Z: JOB_MESSAGE_BASIC: Worker 
configuration: n1-standard-1 in us-central1-b.
16:55:27 root: INFO: 2019-07-14T23:41:30.701Z: JOB_MESSAGE_BASIC: Executing 
BigQuery import job "dataflow_job_9064238672948471335". You can check its 
status with the bq tool: "bq show -j --project_id=apache-beam-testing 
dataflow_job_9064238672948471335".
16:55:27 root: INFO: 2019-07-14T23:41:41.319Z: JOB_MESSAGE_BASIC: BigQuery 
import job "dataflow_job_9064238672948471335" done.
16:55:27 root: INFO: 2019-07-14T23:41:42.059Z: JOB_MESSAGE_BASIC: Finished 
operation create/Read+write/WriteToBigQuery/NativeWrite
16:55:27 root: INFO: 2019-07-14T23:41:42.137Z: JOB_MESSAGE_DEBUG: Executing 
success step success1
16:55:27 root: INFO: 2019-07-14T23:41:42.280Z: JOB_MESSAGE_DETAILED: Cleaning 
up.
16:55:27 root: INFO: 2019-07-14T23:41:42.337Z: JOB_MESSAGE_DEBUG: Starting 
worker pool teardown.
16:55:27 root: INFO: 2019-07-14T23:41:42.371Z: JOB_MESSAGE_BASIC: Stopping 
worker pool...
{noformat}

However, Dataflow runner believes that the job is in a failing state and 
attempts to cancel it, by this time job succeeds: 

{noformat}
16:55:27 root: WARNING: Cancel failed because job 
2019-07-14_16_36_07-9641014797750228421 is already terminated in state DONE.
{noformat}

In this case we could retry with exponential backoff the 503 error a few more 
times. 

[1] 
https://github.com/apache/beam/blob/f7cbf88f550c8918b99a13af4182d6efa07cd2b5/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L1315

> Gracefully handle exceptions when waiting for Dataflow job completion.
> ----------------------------------------------------------------------
>
>                 Key: BEAM-6202
>                 URL: https://issues.apache.org/jira/browse/BEAM-6202
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Robert Bradshaw
>            Priority: Major
>
> If there is an error when trying to contact the dataflow service in Python's 
> Dataflow.poll_for_job_completion, we may exit the thread prematurely. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to