[
https://issues.apache.org/jira/browse/BEAM-14364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530946#comment-17530946
]
Darren Norton commented on BEAM-14364:
--------------------------------------
> Happy to look at a PR to improve the performance when you have one. But we
> have to make sure that this works for all cases and does not change the
> behavior of the connector in a backwards incompatible way.
I'm actually working on optimizing the performance of a Dataflow job, as an
Apache Beam + Dataflow user, so I probably won't open a PR for this specific
issue.
> I believe not having a way to handle/unblock streaming without recreating the
> table (assuming that the table points to something that you can actually
> create) can cause major issues. Draining a job doesn't work too, so you may
> have to forcibly cancel and it can cause data loss.
This is exactly what happened when I deployed the job to my team's development
environment, and why we opened the support case with Svetak.
> are you interested in looking into whether we can expand this functionality
> to include bundle failures ?
Just to clarify, when BigQueryIO goes to write a record to a non-existant table
using STREAMING_INSERTS, with a createDisposition of CREATE_NEVER, does it
cause a bundle failure? I'm mostly a user, so I don't have the familiarity to
know why changing the retry policy didn't cause the infinite retry to stop, but
my understanding is that the bundle failure caused Dataflow to reprocess the
elements in that bundle. Is that accurate?
> 404s in BigQueryIO don't get output to Failed Inserts PCollection
> -----------------------------------------------------------------
>
> Key: BEAM-14364
> URL: https://issues.apache.org/jira/browse/BEAM-14364
> Project: Beam
> Issue Type: Bug
> Components: io-py-gcp
> Reporter: Svetak Vihaan Sundhar
> Assignee: Svetak Vihaan Sundhar
> Priority: P1
> Attachments: ErrorsInPrototypeJob.PNG
>
>
> Given that BigQueryIO is configured to use createDisposition(CREATE_NEVER),
> and the DynamicDestinations class returns "null" for a schema,
> and the table for that destination does not exist in BigQuery, When I stream
> records to BigQuery for that table, then the write should fail,
> and the failed rows should appear on the output PCollection for Failed
> Inserts (via getFailedInserts().
>
> Almost all of the time, the table exists before hand, but given that new
> tables can be created, we want this behavior to be non-explosive to the Job,
> however, what we are seeing is that processing completely stops in those
> pipelines, and eventually the jobs run out of memory. I feel that the
> appropriate action when BigQuery 404's for the table, would be to submit
> those failed TableRows to the output PCollection and continue processing as
> normal.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)