Re: UTF-8 passthrough with beam Rows

Steve Niemitz Tue, 30 Nov 2021 10:23:00 -0800

> So we could potentially make sure some sidecar tracking encoded bytes
always gets propagated in the higher-level APIs we provide.


I'm curious what that API would look like.  It seems like if users were
allowed to provide bytes many of the API surfaces would need to be special
cased to then re-convert them to Strings on-the-fly.  On the other hand,
replacing the String coder with one that could handle both allows us to
handle all of the logic in one place.

It is also more complicated than just storing the original bytes on the
side because they can be embedded in more complicated collections, for
example, a Map<String, List<String>>

On Tue, Nov 30, 2021 at 12:49 PM Brian Hulette <[email protected]> wrote:

>
>
> On Tue, Nov 30, 2021 at 9:20 AM Reuven Lax <[email protected]> wrote:
>
>> We already do this sort of thing for many types (avro, pojo, etc.)
>>
>
> IIUC we just use RowWithGetters rather than eagerly decoding - we don't
> access the original encoded bytes when encoding the Row.
>
>
>>
>> On Tue, Nov 30, 2021 at 8:59 AM Brian Hulette <[email protected]>
>> wrote:
>>
>>> An alternative approach that would be backwards-compatible could be to
>>> keep track of the original encoded bytes in the Row object. Unlike Avro we
>>> have a lot of control over creating new Row objects, since we discourage
>>> users from manipulating Rows directly. So we could potentially make sure
>>> some sidecar tracking encoded bytes always gets propagated in the
>>> higher-level APIs we provide.
>>>
>>> On Tue, Nov 30, 2021 at 8:33 AM Steve Niemitz <[email protected]>
>>> wrote:
>>>
>>>> > Regardless, the Encode Row step would end up decoding the String and
>>>> then re-encoding it, perhaps you're envisioning we could short-circuit this
>>>> and access the encoded String?
>>>> Yes, exactly.  The avro encoder/decoder does something very similar to
>>>> this for the same reason.
>>>>
>>>> On Tue, Nov 30, 2021 at 11:26 AM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> Sources should generally produce instances of RowWithGetters [1] which
>>>>> lazily accesses fields from some underlying object. This should at least
>>>>> avoid decoding a String for your first two steps as long as it's not
>>>>> accessed. I'm not sure if accesses are memoized though - we may be
>>>>> re-decoding if the String is accessed multiple times.
>>>>>
>>>>> Regardless, the Encode Row step would end up decoding the String and
>>>>> then re-encoding it, perhaps you're envisioning we could short-circuit 
>>>>> this
>>>>> and access the encoded String?
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java
>>>>>
>>>>> On Tue, Nov 30, 2021 at 8:17 AM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>> I'm intrigued - how do you imagine doing this in RowCoder?
>>>>>>
>>>>>> On Tue, Nov 30, 2021 at 7:49 AM Steve Niemitz <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> A common use case we're running into with beam rows is
>>>>>>> something like:
>>>>>>> - Read data from source X
>>>>>>> - Convert to Row
>>>>>>> - Encode row (generally for xlang)
>>>>>>>
>>>>>>> In cases like this, I've noticed that we spend a significant (30%+)
>>>>>>> amount of time just decoding and re-encoding strings.
>>>>>>>
>>>>>>> Avro has a nice solution to this with its Utf8 class [1] which
>>>>>>> defers decoding the string until actually needed.  I'm curious if 
>>>>>>> there's
>>>>>>> been any thought around optimizing this in beam as well?  It doesn't 
>>>>>>> seem
>>>>>>> like it'd be hard to support it in the RowCoder implementation right 
>>>>>>> now.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/util/Utf8.html
>>>>>>>
>>>>>>

Re: UTF-8 passthrough with beam Rows

Reply via email to