[
https://issues.apache.org/jira/browse/BEAM-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157905#comment-16157905
]
Chamikara Jayalath commented on BEAM-2858:
------------------------------------------
I tried by manually deleting the first file of gcsURIs passed in following
location.
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L157
Attached the diff.
> temp file garbage collection in BigQuery sink should be in a separate DoFn
> --------------------------------------------------------------------------
>
> Key: BEAM-2858
> URL: https://issues.apache.org/jira/browse/BEAM-2858
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-gcp
> Affects Versions: 2.1.0
> Reporter: Reuven Lax
> Assignee: Chamikara Jayalath
> Fix For: 2.2.0
>
> Attachments: delete_file_diff.txt
>
>
> Currently the WriteTables transform deletes the set of input files as soon as
> the load() job completes. However this is incorrect - if the task fails
> partially through deleting files (e.g. if the worker crashes), the task will
> be retried. If the write disposition is WRITE_TRUNCATE, bad things could
> result.
> The resulting behavior will depend on what BQ does if one of input files is
> missing (because we had previously deleted it). In the best case, BQ will
> fail the load. In this case the step will keep failing until the runner
> finally fails the entire job. If however BQ ignores the missing file, the
> load will overwrite the previously-written table with the smaller set of
> files and the job will succeed. This is the worst-case scenario, as it will
> result in data loss.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)