Re: UTF-8 passthrough with beam Rows

Steve Niemitz Tue, 30 Nov 2021 10:46:24 -0800

> Last time I ran profiles on our Nexmark benchmarks, this was the top item
on the profile.


Glad it's not just me!

> String is effectively a logical type with "byte[]" as the input type and
"String" as the base type.

I think this is conflating the serialization format with the runtime
representation.  You could really say any type is a logical type with a
byte[] backing here.  I do agree though that it'd be nice to solve this
more generally, strings are the most noticeable case but as you said, it
could apply to anything.

On Tue, Nov 30, 2021 at 1:32 PM Andrew Pilloud <[email protected]> wrote:

> Last time I ran profiles on our Nexmark benchmarks, this was the top item
> on the profile. This affects logical types, it would be nice to have a
> generic solution. String is effectively a logical type with "byte[]" as the
> input type and "String" as the base type.
>
> At one point we had a way to store logical types in either their Input or
> Base type in the row and only convert between the two formats when needed.
> It would be a big improvement to only perform these transforms as needed
> and to be able to pass through columns without conversion. The former API
> was 'Object getValue(...)' which would return either the base type or
> logical type. That  approach was fragile (relied on Object) and broken by
> other performance work so it was removed:
> https://github.com/apache/beam/pull/11074
>
> Andrew
>
> On Tue, Nov 30, 2021 at 10:23 AM Steve Niemitz <[email protected]>
> wrote:
>
>> > So we could potentially make sure some sidecar tracking encoded bytes
>> always gets propagated in the higher-level APIs we provide.
>>
>> I'm curious what that API would look like.  It seems like if users were
>> allowed to provide bytes many of the API surfaces would need to be special
>> cased to then re-convert them to Strings on-the-fly.  On the other hand,
>> replacing the String coder with one that could handle both allows us to
>> handle all of the logic in one place.
>>
>> It is also more complicated than just storing the original bytes on the
>> side because they can be embedded in more complicated collections, for
>> example, a Map<String, List<String>>
>>
>> On Tue, Nov 30, 2021 at 12:49 PM Brian Hulette <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Tue, Nov 30, 2021 at 9:20 AM Reuven Lax <[email protected]> wrote:
>>>
>>>> We already do this sort of thing for many types (avro, pojo, etc.)
>>>>
>>>
>>> IIUC we just use RowWithGetters rather than eagerly decoding - we don't
>>> access the original encoded bytes when encoding the Row.
>>>
>>>
>>>>
>>>> On Tue, Nov 30, 2021 at 8:59 AM Brian Hulette <[email protected]>
>>>> wrote:
>>>>
>>>>> An alternative approach that would be backwards-compatible could be to
>>>>> keep track of the original encoded bytes in the Row object. Unlike Avro we
>>>>> have a lot of control over creating new Row objects, since we discourage
>>>>> users from manipulating Rows directly. So we could potentially make sure
>>>>> some sidecar tracking encoded bytes always gets propagated in the
>>>>> higher-level APIs we provide.
>>>>>
>>>>> On Tue, Nov 30, 2021 at 8:33 AM Steve Niemitz <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> > Regardless, the Encode Row step would end up decoding the String
>>>>>> and then re-encoding it, perhaps you're envisioning we could 
>>>>>> short-circuit
>>>>>> this and access the encoded String?
>>>>>> Yes, exactly.  The avro encoder/decoder does something very similar
>>>>>> to this for the same reason.
>>>>>>
>>>>>> On Tue, Nov 30, 2021 at 11:26 AM Brian Hulette <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Sources should generally produce instances of RowWithGetters [1]
>>>>>>> which lazily accesses fields from some underlying object. This should at
>>>>>>> least avoid decoding a String for your first two steps as long as it's 
>>>>>>> not
>>>>>>> accessed. I'm not sure if accesses are memoized though - we may be
>>>>>>> re-decoding if the String is accessed multiple times.
>>>>>>>
>>>>>>> Regardless, the Encode Row step would end up decoding the String and
>>>>>>> then re-encoding it, perhaps you're envisioning we could short-circuit 
>>>>>>> this
>>>>>>> and access the encoded String?
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java
>>>>>>>
>>>>>>> On Tue, Nov 30, 2021 at 8:17 AM Reuven Lax <[email protected]> wrote:
>>>>>>>
>>>>>>>> I'm intrigued - how do you imagine doing this in RowCoder?
>>>>>>>>
>>>>>>>> On Tue, Nov 30, 2021 at 7:49 AM Steve Niemitz <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> A common use case we're running into with beam rows is
>>>>>>>>> something like:
>>>>>>>>> - Read data from source X
>>>>>>>>> - Convert to Row
>>>>>>>>> - Encode row (generally for xlang)
>>>>>>>>>
>>>>>>>>> In cases like this, I've noticed that we spend a significant
>>>>>>>>> (30%+) amount of time just decoding and re-encoding strings.
>>>>>>>>>
>>>>>>>>> Avro has a nice solution to this with its Utf8 class [1] which
>>>>>>>>> defers decoding the string until actually needed.  I'm curious if 
>>>>>>>>> there's
>>>>>>>>> been any thought around optimizing this in beam as well?  It doesn't 
>>>>>>>>> seem
>>>>>>>>> like it'd be hard to support it in the RowCoder implementation right 
>>>>>>>>> now.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/util/Utf8.html
>>>>>>>>>
>>>>>>>>

Re: UTF-8 passthrough with beam Rows

Reply via email to