[
https://issues.apache.org/jira/browse/BEAM-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163529#comment-16163529
]
Reuven Lax commented on BEAM-2858:
----------------------------------
I just reproduced this and verified it does not cause data loss. The load job
fails (on qeuery) with a 409. The message is
Error encountered during job execution:
Not found: URI
gs://bigquery_beam_testing_regional/temp/BigQueryWriteTemp/c7fb6a3d06fa4ceab662f83488cc6d31/c5db57f8-9cc0-4cad-9a7e-9c56cb572177
However this is still a critical bug. Streaming jobs get blocked forever,
because the job fails on every retry. Batch jobs will retry this several times
and then fail.
> temp file garbage collection in BigQuery sink should be in a separate DoFn
> --------------------------------------------------------------------------
>
> Key: BEAM-2858
> URL: https://issues.apache.org/jira/browse/BEAM-2858
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-gcp
> Affects Versions: 2.1.0
> Reporter: Reuven Lax
> Assignee: Reuven Lax
> Fix For: 2.2.0
>
> Attachments: delete_file_diff.txt
>
>
> Currently the WriteTables transform deletes the set of input files as soon as
> the load() job completes. However this is incorrect - if the task fails
> partially through deleting files (e.g. if the worker crashes), the task will
> be retried. If the write disposition is WRITE_TRUNCATE, bad things could
> result.
> The resulting behavior will depend on what BQ does if one of input files is
> missing (because we had previously deleted it). In the best case, BQ will
> fail the load. In this case the step will keep failing until the runner
> finally fails the entire job. If however BQ ignores the missing file, the
> load will overwrite the previously-written table with the smaller set of
> files and the job will succeed. This is the worst-case scenario, as it will
> result in data loss.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)