Oskar Firlej created BEAM-14383:
-----------------------------------
Summary: Improve "FailedRows" errors returned by
beam.io.WriteToBigQuery
Key: BEAM-14383
URL: https://issues.apache.org/jira/browse/BEAM-14383
Project: Beam
Issue Type: Improvement
Components: io-py-gcp
Reporter: Oskar Firlej
`WriteToBigQuery` pipeline returns `errors` when trying to insert rows that do
not match the BigQuery table schema. `errors` is a dictionary that cointains
one `FailedRows` key. `FailedRows` is a list of tuples where each tuple has two
elements: BigQuery table name and the row that didn't match the schema.
This can be verified by running the `BigQueryIO deadletter pattern`
https://beam.apache.org/documentation/patterns/bigqueryio/
Using this approach I can print the failed rows in a pipeline. When running the
job, logger simultaneously prints out the reason why the rows were invalid. The
reason should also be included in the tuple in addition to the BigQuery table
and the raw row. This way next pipeline could process both the invalid row and
the reason why it is invalid.
During my reasearch i found a couple of alternate solutions, but i think they
are more complex than they need to be. Thats why i explored the beam source
code and found the solution to be an easy and simple change.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)