I'll summarize my input to the discussion. It is rather high level. But IMO:

 - even though schemas are part of Beam Java today, I think they should
become part of portability when ready
 - so each type in a schema needs a language-independent &
encoding-independent notion of domain of values and equality (so obviously
equal bytes are equal)
 - embedding in any language (hence Row in Java) must have a schema
type-driven equality that matches this spec
 - also each type (hence Row type) should have portable encoding(s) that
respect this equality so shuffling is consistent
 - Row in Java should be able to decode these encodings to different
underlying representations and change its strategy over time

Kenn

On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov <g...@spotify.com> wrote:

> With adding BYTES type, we broke equality. `RowCoder#consistentWithEquals`
> is always true, but this property doesn't hold for exotic types such as
> `Map<BYTES, ?>`, `List<BYTES>`. The root cause is `byte[]`, where `equals`
> is implemented as reference equality instead of structural.
>
> Before we jump into solution mode, let's state what we want to have:
> - *API* have stable API and be able to evolve efficient and use-case
> specific implementations without breaking it
> - *Correctness *we can't trade off correctness, a trivial implementation
> should work
> - *Performance *comparing equality is a fundamental operation, and we
> want to make it cheap
>
> 1. set `consistentWithEquals` if there is BYTES field
> Pros: almost no pros
> Cons: It would introduce a significant number of allocations when
> comparing rows, so we reject this option.
>
> 2. implement custom deep equals in `Row#equals`
> Pros: good performance, doesn't change API, `Row#equals` is correct
> Cons: doesn't work for `Map<byte[], ?>`, unless we roll own implementation
> Cons: it's possible to obtain `List<byte[]>` from `getValue()` that has
> broken equality, contains, etc, unless we roll own implementation
> Cons: non-trivial and requires ~200LOC to implement
>
> 3. wrapping byte[] into Java object with fixed equality (e.g.,
> StructuralByteArray)
> Pros: good performance and flexible to change how Java wrapper is
> implemented
> Pros: simple, doesn't require any specialized collections, no surprises,
> `Map<byte[], ?>` and `List<byte[]>` work.
> Cons: will change the return type of `Row#getValue`
>
> I want to suggest going with option #3. However, it isn't completely clear
> what wrapper we want to use, either it could be StructuralByteArray, or
> ByteBuffer. ByteBuffer is more standard. However, it comes with 4
> additional integer fields. StructuralByteArray doesn't have anything not
> necessary. One option would be adding `Row#getByteBuffer` that would be
> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can
> override it for better performance, but `getValue(i)` must return
> StructuralByteArray.
>
> References:
> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845
> - [BEAM-5646] Fix quality and hashcode for bytes in Row,
> https://github.com/apache/beam/pull/6765
>
> Gleb
>

Reply via email to