jgill02 opened a new issue, #26316: URL: https://github.com/apache/beam/issues/26316
### What happened? Beam go sdk 2.46 - avroio package (with dataflow runner). I believe that [this](https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/io/avroio/avroio.go#L104) line contributes to a bit of an edge case issue in situations where there are nil values for columns in a .avro file (using OCF). For reference, I encountered this issue using a bigquery EXPORT DATA OPTIONS (to avro) command, and then reading the .avro file in GCS from a dataflow job. Specifically the below snippet: ``` val := reflect.New(f.Type.T).Interface() for ar.Scan() { var i any i, err = ar.Read() if err != nil { log.Errorf(ctx, "error reading avro row: %v", err) continue } // marshal interface to bytes var b []byte b, err = json.Marshal(i) if err != nil { log.Errorf(ctx, "error unmarshalling avro data: %v", err) return } switch reflect.New(f.Type.T).Interface().(type) { case *string: emit(string(b)) default: if err = json.Unmarshal(b, val); err != nil { log.Errorf(ctx, "error unmashalling avro to type: %v", err) return } emit(reflect.ValueOf(val).Elem().Interface()) } } ``` By declaring '**val**' outside of the scan() for loop, if a record (e.g. 'row 1') in a .avro file has a column (e.g. "foo") with a value (e.g. "bar"), and then in a subsequent record (e.g. 'row 2') the subsequent value for the "foo" column is nil (i.e. no data), then the bottom 'emit(reflect.ValueOf(**val**).Elem().Interface())' statement will emit "bar" for "foo" for "row 2", when instead {} (empty value) should be emitted for that column. I believe this is because "val" is a pointer, and therefore if upon one invocation of the ar.Scan() / ar.Read() loop some of the elements are filled with data, if a subsequent datum in the avro file is nil for a given column, then that column won't be overridden with a nil value / the previous datum's value will be persisted in the final emit statement. By moving the "val :=" declaration just above "var i any" / just below the "for ar.Scan() {" loop, this issue can be mitigated, though I acknowledge that may incur a bit of a performance penalty in doing so. For reference, the bigquery "export data options" statement (to avro) will have its exported avro files have an ocf header with nullable fields like so: "fields":[{"name": "foo", type":["null","string"],"default":null}] ### Issue Priority Priority: 2 (default / most bugs should be filed as P2) ### Issue Components - [ ] Component: Python SDK - [ ] Component: Java SDK - [X] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [X] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
