Oskar Firlej created BEAM-14383:
-----------------------------------

             Summary: Improve "FailedRows" errors returned by 
beam.io.WriteToBigQuery
                 Key: BEAM-14383
                 URL: https://issues.apache.org/jira/browse/BEAM-14383
             Project: Beam
          Issue Type: Improvement
          Components: io-py-gcp
            Reporter: Oskar Firlej


`WriteToBigQuery` pipeline returns `errors` when trying to insert rows that do 
not match the BigQuery table schema. `errors` is a dictionary that cointains 
one `FailedRows` key. `FailedRows` is a list of tuples where each tuple has two 
elements: BigQuery table name and the row that didn't match the schema.

This can be verified by running the `BigQueryIO deadletter pattern` 
https://beam.apache.org/documentation/patterns/bigqueryio/

Using this approach I can print the failed rows in a pipeline. When running the 
job, logger simultaneously prints out the reason why the rows were invalid. The 
reason should also be included in the tuple in addition to the BigQuery table 
and the raw row. This way next pipeline could process both the invalid row and 
the reason why it is invalid.

During my reasearch i found a couple of alternate solutions, but i think they 
are more complex than they need to be. Thats why i explored the beam source 
code and found the solution to be an easy and simple change.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to