Thank you for digging into this and figuring out how this bug was
introduced!

In the long-term I think it would be preferable to avoid
TableRow altogether in order to do a schema-aware read of avro data from
BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
would also be preferable for Arrow, where we could produce Row instances
that are references into an underlying arrow RecordBatch (using
RowWithGetters), rather than having to materialize each row to make a
TableRow instance.

For a short-term fix there are two options, both came up in Reuven's
comments on BEAM-9613:

> However if we expect to get Long, Double,etc. objects in the TableRow,
> then this mapping code needs to handle those objects. Handling them
> directly would be more efficient - converting to a String would simply be a
> stopgap "one-line" fix.


1. handle them directly [in BigQueryUtils], this is what you've done in
https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
2. convert to a String [in BigQueryAvroUtils]

I don't have a strong preference but I think (2) is a cleaner, albeit less
performant, solution. It seems that BigQueryUtils expects all values in
TableRow instances to be String instances. Since BigQueryAvroUtils is just
a shim to convert GenericRecord to TableRow for use in BigQueryUtils, it
should comply with that interface, rather than making BigQueryUtils work
around the discrepancy.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java

On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <[email protected]>
wrote:

> I'm working on fixing BEAM-9613 because encountered this issue as a result
> of using BigQueryIO.readTableRowsWithSchema()
>
>    1. BEAM-7526 provided support for Lists and Maps that came from the
>    JSON export format
>    2. BEAM-2879 switched the export format from JSON to Avro.  This
>    caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and FLOAT as
>    a Java String but rather Java Boolean and Double.
>    3. The switch from JSON to Avro also introduced an issue with fields
>    with BQ REPEATED mode because fields of this mode.
>
> I have a simple fix to handle BQ BOOLEAN and FLOAT (
> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb)
> however I'm a bit uncomfortable with it because
>
>    1. This would introduce mixing of both the JSON and Avro export
>    formats,
>    2. BEAM-8933 while still in progress would introduce Arrow and risk a
>    regression,
>    3. I haven't made a fix for REPEATED mode yet, but tests that use
>    BigQueryUtilsTest.BQ_ARRAY_ROW would have to change (
>    
> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311).
>    I don't know if I should change this because I don't know if Beam wishes to
>    continue support for the JSON format.
>
> I'd like to incorporate these changes in my project as soon as possible,
> but I also need guidance on next steps that would be in line with the
> general direction of the project.  I'm looking forward to any feedback.
> Thanks.
>

Reply via email to