There is an indirect connection to RowCoder because `MapCoder` isn't deterministic, therefore, this doesn't hold:
> - also each type (hence Row type) should have portable encoding(s) that respect this equality so shuffling is consistent I think it's a requirement only for rows we want to shuffle by. > About these specific use cases, how useful is it to support Map<byte[], ?> and List<byte[]>? Not sure about Map, but in BigQuery it's possible to define `ARRAY<BYTES>` type. It can group by BYTES, but not by ARRAYS. On Mon, Oct 29, 2018 at 5:42 PM Anton Kedin <ke...@google.com> wrote: > About these specific use cases, how useful is it to support Map<byte[], ?> > and List<byte[]>? These seem pretty exotic (maybe they aren't) and I wonder > whether it would make sense to just reject them until we have a solid > design. > > And wouldn't the same problems arise even without RowCoder? Is the path in > that case to implement a custom coder? > > Regards, > Anton > > > On Mon, Oct 29, 2018 at 9:05 AM Kenneth Knowles <k...@apache.org> wrote: > >> I'll summarize my input to the discussion. It is rather high level. But >> IMO: >> >> - even though schemas are part of Beam Java today, I think they should >> become part of portability when ready >> - so each type in a schema needs a language-independent & >> encoding-independent notion of domain of values and equality (so obviously >> equal bytes are equal) >> - embedding in any language (hence Row in Java) must have a schema >> type-driven equality that matches this spec >> - also each type (hence Row type) should have portable encoding(s) that >> respect this equality so shuffling is consistent >> - Row in Java should be able to decode these encodings to different >> underlying representations and change its strategy over time >> >> Kenn >> >> On Mon, Oct 29, 2018 at 8:08 AM Gleb Kanterov <g...@spotify.com> wrote: >> >>> With adding BYTES type, we broke equality. >>> `RowCoder#consistentWithEquals` is always true, but this property doesn't >>> hold for exotic types such as `Map<BYTES, ?>`, `List<BYTES>`. The root >>> cause is `byte[]`, where `equals` is implemented as reference equality >>> instead of structural. >>> >>> Before we jump into solution mode, let's state what we want to have: >>> - *API* have stable API and be able to evolve efficient and use-case >>> specific implementations without breaking it >>> - *Correctness *we can't trade off correctness, a trivial >>> implementation should work >>> - *Performance *comparing equality is a fundamental operation, and we >>> want to make it cheap >>> >>> 1. set `consistentWithEquals` if there is BYTES field >>> Pros: almost no pros >>> Cons: It would introduce a significant number of allocations when >>> comparing rows, so we reject this option. >>> >>> 2. implement custom deep equals in `Row#equals` >>> Pros: good performance, doesn't change API, `Row#equals` is correct >>> Cons: doesn't work for `Map<byte[], ?>`, unless we roll own >>> implementation >>> Cons: it's possible to obtain `List<byte[]>` from `getValue()` that has >>> broken equality, contains, etc, unless we roll own implementation >>> Cons: non-trivial and requires ~200LOC to implement >>> >>> 3. wrapping byte[] into Java object with fixed equality (e.g., >>> StructuralByteArray) >>> Pros: good performance and flexible to change how Java wrapper is >>> implemented >>> Pros: simple, doesn't require any specialized collections, no surprises, >>> `Map<byte[], ?>` and `List<byte[]>` work. >>> Cons: will change the return type of `Row#getValue` >>> >>> I want to suggest going with option #3. However, it isn't completely >>> clear what wrapper we want to use, either it could be StructuralByteArray, >>> or ByteBuffer. ByteBuffer is more standard. However, it comes with 4 >>> additional integer fields. StructuralByteArray doesn't have anything not >>> necessary. One option would be adding `Row#getByteBuffer` that would be >>> `ByteBuffer.wrap(getValue(i).getValues())`, specialized implementation can >>> override it for better performance, but `getValue(i)` must return >>> StructuralByteArray. >>> >>> References: >>> - [BEAM-5866] Fix `Row#equals`, https://github.com/apache/beam/pull/6845 >>> - [BEAM-5646] Fix quality and hashcode for bytes in Row, >>> https://github.com/apache/beam/pull/6765 >>> >>> Gleb >>> >> -- Cheers, Gleb