[
https://issues.apache.org/jira/browse/BEAM-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884850#comment-16884850
]
Valentyn Tymofieiev commented on BEAM-6202:
-------------------------------------------
This issue contributes to postsubmit flakiness, cc: [~udim].
Example manifestation of this error that we may be able to fix in the SDK:
poll_for_job_completion thread [1] encounters 503 :
{noformat}
16:55:27 root: INFO: 2019-07-14T23:38:13.223Z: JOB_MESSAGE_DETAILED: Workers
have started successfully.
16:55:27 root: DEBUG: Response returned status 503, retrying
16:55:27 root: DEBUG: Retrying request to url
https://dataflow.googleapis.com/v1b3/projects/apache-beam-testing/locations/us-central1/jobs/2019-07-14_16_36_07-9641014797750228421?alt=json
after exception HttpError accessing
<https://dataflow.googleapis.com/v1b3/projects/apache-beam-testing/locations/us-central1/jobs/2019-07-14_16_36_07-9641014797750228421?alt=json>:
response: <{'vary': 'Origin, X-Origin, Referer', 'content-type':
'application/json; charset=UTF-8', 'date': 'Sun, 14 Jul 2019 23:39:25 GMT',
'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0',
'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff',
'transfer-encoding': 'chunked', 'status': '503', 'content-length': '102',
'-content-encoding': 'gzip'}>, content <{
16:55:27 "error": {
16:55:27 "code": 503,
16:55:27 "message": "Deadline exceeded",
16:55:27 "status": "UNAVAILABLE"
16:55:27 }
16:55:27 }
16:55:27 >
{noformat}
Based on console logs, execution continues, and finishes successfully:
{noformat}
16:55:27 root: INFO: 2019-07-14T23:39:12.556Z: JOB_MESSAGE_BASIC: Worker
configuration: n1-standard-1 in us-central1-b.
16:55:27 root: INFO: 2019-07-14T23:41:30.701Z: JOB_MESSAGE_BASIC: Executing
BigQuery import job "dataflow_job_9064238672948471335". You can check its
status with the bq tool: "bq show -j --project_id=apache-beam-testing
dataflow_job_9064238672948471335".
16:55:27 root: INFO: 2019-07-14T23:41:41.319Z: JOB_MESSAGE_BASIC: BigQuery
import job "dataflow_job_9064238672948471335" done.
16:55:27 root: INFO: 2019-07-14T23:41:42.059Z: JOB_MESSAGE_BASIC: Finished
operation create/Read+write/WriteToBigQuery/NativeWrite
16:55:27 root: INFO: 2019-07-14T23:41:42.137Z: JOB_MESSAGE_DEBUG: Executing
success step success1
16:55:27 root: INFO: 2019-07-14T23:41:42.280Z: JOB_MESSAGE_DETAILED: Cleaning
up.
16:55:27 root: INFO: 2019-07-14T23:41:42.337Z: JOB_MESSAGE_DEBUG: Starting
worker pool teardown.
16:55:27 root: INFO: 2019-07-14T23:41:42.371Z: JOB_MESSAGE_BASIC: Stopping
worker pool...
{noformat}
However, Dataflow runner believes that the job is in a failing state and
attempts to cancel it, by this time job succeeds:
{noformat}
16:55:27 root: WARNING: Cancel failed because job
2019-07-14_16_36_07-9641014797750228421 is already terminated in state DONE.
{noformat}
In this case we could retry with exponential backoff the 503 error a few more
times.
[1]
https://github.com/apache/beam/blob/f7cbf88f550c8918b99a13af4182d6efa07cd2b5/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L1315
> Gracefully handle exceptions when waiting for Dataflow job completion.
> ----------------------------------------------------------------------
>
> Key: BEAM-6202
> URL: https://issues.apache.org/jira/browse/BEAM-6202
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core
> Reporter: Robert Bradshaw
> Priority: Major
>
> If there is an error when trying to contact the dataflow service in Python's
> Dataflow.poll_for_job_completion, we may exit the thread prematurely.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)