reuvenlax commented on issue #23291: URL: https://github.com/apache/beam/issues/23291#issuecomment-1322579939
This is an interesting situation. The contract with Storage API is that the descriptor passed in when you open the connection is compatible with the actual BQ table schema, and this descriptor is derived from the schema returned by getSchema. In this case, BigQuery is failing things when we establish the connection (due to the schema mismatch) and before we've sent a single record, which is why things aren't going to the failedInserts collection. In general, this seems like an unsupported feature. The schema returned by getSchema must be compatible with the actual BQ table schema. If it does not, various things could go wrong in strange ways. Can you explain how things get out of sync here? Adding new REQUIRED columns to a table is not allowed by BigQuery, so this isn't simply a case where the schema service is returning and old value for the schema. Another note: if you are calling an external service in getSchema, make sure the value is well cached locally. getSchema is called potentially on every record, so this could cause major performance issues in your pipeline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
