There is an indirect connection to RowCoder because `MapCoder` isn't
deterministic, therefore, this doesn't hold:

>  - also each type (hence Row type) should have portable encoding(s) that
respect this equality so shuffling is consistent

I think it's a requirement only for rows we want to shuffle by.

> About these specific use cases, how useful is it to support Map<byte[],
?> and List<byte[]>?

Not sure about Map, but in BigQuery it's possible to define `ARRAY<BYTES>`
type. It can group by BYTES, but not by ARRAYS.


On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin <ke...@google.com> wrote:

> About these specific use cases, how useful is it to support Map<byte[], ?>
> and List<byte[]>? These seem pretty exotic (maybe they aren't) and I wonder
> whether it would make sense to just reject them until we have a solid
> design.
>
> And wouldn't the same problems arise even without RowCoder? Is the path in
> that case to implement a custom coder?
>
> Regards,
> Anton
>
>
> On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles <k...@apache.org> wrote:
>
>> I'll summarize my input to the discussion. It is rather high level. But
>> IMO:
>>
>>  - even though schemas are part of Beam Java today, I think they should
>> become part of portability when ready
>>  - so each type in a schema needs a language-independent &
>> encoding-independent notion of domain of values and equality (so obviously
>> equal bytes are equal)
>>  - embedding in any language (hence Row in Java) must have a schema
>> type-driven equality that matches this spec
>>  - also each type (hence Row type) should have portable encoding(s) that
>> respect this equality so shuffling is consistent
>>  - Row in Java should be able to decode these encodings to different
>> underlying representations and change its strategy over time
>>
>> Kenn
>>
>> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov <g...@spotify.com> wrote:
>>
>>> With adding BYTES type, we broke equality.
>>> `RowCoder#consistentWithEquals` is always true, but this property doesn't
>>> hold for exotic types such as `Map<BYTES, ?>`, `List<BYTES>`. The root
>>> cause is `byte[]`, where `equals` is implemented as reference equality
>>> instead of structural.
>>>
>>> Before we jump into solution mode, let's state what we want to have:
>>> - *API* have stable API and be able to evolve efficient and use-case
>>> specific implementations without breaking it
>>> - *Correctness *we can't trade off correctness, a trivial
>>> implementation should work
>>> - *Performance *comparing equality is a fundamental operation, and we
>>> want to make it cheap
>>>
>>> 1. set `consistentWithEquals` if there is BYTES field
>>> Pros: almost no pros
>>> Cons: It would introduce a significant number of allocations when
>>> comparing rows, so we reject this option.
>>>
>>> 2. implement custom deep equals in `Row#equals`
>>> Pros: good performance, doesn't change API, `Row#equals` is correct
>>> Cons: doesn't work for `Map<byte[], ?>`, unless we roll own
>>> implementation
>>> Cons: it's possible to obtain `List<byte[]>` from `getValue()` that has
>>> broken equality, contains, etc, unless we roll own implementation
>>> Cons: non-trivial and requires ~200LOC to implement
>>>
>>> 3. wrapping byte[] into Java object with fixed equality (e.g.,
>>> StructuralByteArray)
>>> Pros: good performance and flexible to change how Java wrapper is
>>> implemented
>>> Pros: simple, doesn't require any specialized collections, no surprises,
>>> `Map<byte[], ?>` and `List<byte[]>` work.
>>> Cons: will change the return type of `Row#getValue`
>>>
>>> I want to suggest going with option #3. However, it isn't completely
>>> clear what wrapper we want to use, either it could be StructuralByteArray,
>>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4
>>> additional integer fields. StructuralByteArray doesn't have anything not
>>> necessary. One option would be adding `Row#getByteBuffer` that would be
>>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
>>> override it for better performance, but `getValue(i)` must return
>>> StructuralByteArray.
>>>
>>> References:
>>> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
>>> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
>>> https://github.com/apache/beam/pull/6765
>>>
>>> Gleb
>>>
>>

-- 
Cheers,
Gleb

Reply via email to