Thank you for the feedback.  I walked back my initial approach in favour of
Brian's option (2) and also implemented a fix for lists as well (
https://github.com/apache/beam/pull/14350).  I agree with the tradeoff
Brian pointed out as it is consistent with the rest of that component.  If
performance ends up being a problem BigQueryUtils could have a different
mode for TableRow and Avro.

On Tue, Mar 23, 2021 at 1:47 PM Reuven Lax <[email protected]> wrote:

> Logically, all JSON values are string. We often have put other objects in
> there, which I believe works simply because of the implicit .toString()
> method on those objects, but I'm not convinced this is really correct.
>
> On Tue, Mar 23, 2021 at 6:38 AM Brian Hulette <[email protected]> wrote:
>
>> Thank you for digging into this and figuring out how this bug was
>> introduced!
>>
>> In the long-term I think it would be preferable to avoid
>> TableRow altogether in order to do a schema-aware read of avro data from
>> BQ. We can go directly from Avro GenericRecord to Beam Rows now [1]. This
>> would also be preferable for Arrow, where we could produce Row instances
>> that are references into an underlying arrow RecordBatch (using
>> RowWithGetters), rather than having to materialize each row to make a
>> TableRow instance.
>>
>> For a short-term fix there are two options, both came up in Reuven's
>> comments on BEAM-9613:
>>
>>> However if we expect to get Long, Double,etc. objects in the TableRow,
>>> then this mapping code needs to handle those objects. Handling them
>>> directly would be more efficient - converting to a String would simply be a
>>> stopgap "one-line" fix.
>>
>>
>> 1. handle them directly [in BigQueryUtils], this is what you've done in
>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb
>> 2. convert to a String [in BigQueryAvroUtils]
>>
>> I don't have a strong preference but I think (2) is a cleaner, albeit
>> less performant, solution. It seems that BigQueryUtils expects all values
>> in TableRow instances to be String instances. Since BigQueryAvroUtils is
>> just a shim to convert GenericRecord to TableRow for use in BigQueryUtils,
>> it should comply with that interface, rather than making BigQueryUtils work
>> around the discrepancy.
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/AvroRecordSchema.java
>>
>> On Mon, Mar 22, 2021 at 5:59 PM Matthew Ouyang <[email protected]>
>> wrote:
>>
>>> I'm working on fixing BEAM-9613 because encountered this issue as a
>>> result of using BigQueryIO.readTableRowsWithSchema()
>>>
>>>    1. BEAM-7526 provided support for Lists and Maps that came from the
>>>    JSON export format
>>>    2. BEAM-2879 switched the export format from JSON to Avro.  This
>>>    caused issue BEAM-9613 since Avro no longer treated BQ BOOLEAN and FLOAT 
>>> as
>>>    a Java String but rather Java Boolean and Double.
>>>    3. The switch from JSON to Avro also introduced an issue with fields
>>>    with BQ REPEATED mode because fields of this mode.
>>>
>>> I have a simple fix to handle BQ BOOLEAN and FLOAT (
>>> https://github.com/mouyang/beam/commit/326b291ab333c719a9f54446c34611581ea696eb)
>>> however I'm a bit uncomfortable with it because
>>>
>>>    1. This would introduce mixing of both the JSON and Avro export
>>>    formats,
>>>    2. BEAM-8933 while still in progress would introduce Arrow and risk
>>>    a regression,
>>>    3. I haven't made a fix for REPEATED mode yet, but tests that use
>>>    BigQueryUtilsTest.BQ_ARRAY_ROW would have to change (
>>>    
>>> https://github.com/apache/beam/blob/e039ca28d6f806f30b87cae82e6af86694c171cd/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtilsTest.java#L306-L311).
>>>    I don't know if I should change this because I don't know if Beam wishes 
>>> to
>>>    continue support for the JSON format.
>>>
>>> I'd like to incorporate these changes in my project as soon as possible,
>>> but I also need guidance on next steps that would be in line with the
>>> general direction of the project.  I'm looking forward to any feedback.
>>> Thanks.
>>>
>>

Reply via email to