To add some flavor, *All coders:* structuralValue(a).equals(structuralValue(b)) if and only if encode(a) == encode(b)
*"Consistent with equals" aka injective:* encode(a) == encode(b) implies a.equals(b) *Deterministic:* a.equals(b) implies structuralValue(a).equals(structuralValue(b)) (hence encode(a) == encode(b)) The structural value must always be a legitimate substitute for encoding to allow in-memory GBK to be faster than encoding. IMO we should deprecate and retire "consistent with equals" since overriding it to return `true` is no simpler than overriding structuralValue itself, and it has no purpose other than governing structuralValue. The two obvious choices - encoding or return directly - are trivial, and getting fancy is optional. The check Luke suggests would then just be a test that structuralValue is correct. The mutation detector should perhaps just use the structural value and let the coder itself decide whether or not it needs to encode. Also worth considering the dual perspective that highlights portability: To a portable runner, the elements are (with a couple exceptions) just bytes, and the coders are a way for the SDK to interpret them in order to do its computation. The implied spec that the mutation detector relies on is that serialize(deserialize(x)) == x for these bytes, so if the re-serialized bytes have changed, it assumes the object was mutated. In a sense, if an SDK implements "the identity function" yet returns different bytes, that is a broken identity function because the bytes *are* the element. It is a bit of a strict interpretation, and maybe not so useful when the elements are only really interpreted by a single SDK, as in the case of SerializableCoder. But I'm not sure what other spec is available. Kenn On Mon, Nov 27, 2017 at 4:37 PM, Mairbek Khadikov <[email protected]> wrote: > I'm open to renaming *consistentWithEquals*. > > If I understand the code correctly, when consistentWithEquals returns > true, org.apache.beam.sdk.util.MutationDetectors expects > *a.equals(deserialize(serialize(a))* which I think is reasonable for > SerializableCoder (assuming objects implement equals)*. *Right now, > *serialize(a).equals(serialize(deserialize(serialize(a)))* is expected > and that contradicts *"does not guarantee a deterministic encoding"*. > > On Mon, Nov 27, 2017 at 4:07 PM, Lukasz Cwik <[email protected]> wrote: > >> I think the idea is that SerializableCoder should be updated to expect >> that all values it encodes do implement equals() since this seems to be the >> much more common case then classes that don't implement a useful equals. It >> would be possible to add a useful check to DirectRunner that any value that >> says its consistent with equals actually obeys its contract. >> >> On Mon, Nov 27, 2017 at 4:03 PM, Eugene Kirpichov <[email protected]> >> wrote: >> >>> Not sure where you see the contradiction? consistentWithEquals says >>> "Whenever the encoded bytes of two values are equal, then the original >>> values are equal according to {@code Objects.equals()}." - which is clearly >>> false for Serializable's in general: it's possible that serialized form of >>> "a" and "b" is the same bytes, but !a.equals(b), e.g. if this class does >>> not implement equals() or if it uses reference equality. >>> >>> On Mon, Nov 27, 2017 at 3:55 PM Mairbek Khadikov <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> Currently SerializableCoder#consistentWithEquals returns false, which >>>> contradicts it's own documentation "{@link SerializableCoder} does not >>>> guarantee a deterministic encoding, as Java serialization may produce >>>> different binary encodings for two equivalent objects". >>>> >>>> In practice, it leads to failures in org.apache.beam.sdk.util.Mutat >>>> ionDetectors$CodedValueMutationDetector. For example, some libraries >>>> cache hash values https://github.com/google/protobuf/pull/3939/files#di >>>> ff-24c442a23a43e041e4f50d324638fedbR142 >>>> >>>> I'm thinking about changing SerializableCoder#consistentWithEquals to >>>> return true, so that the structural value would be the object itself >>>> instead of it the serialized bytes. Thoughts? >>>> >>>> WIP branch https://github.com/mairbek/beam/tree/ser >>>> >>>> Sincerely, >>>> Mairbek >>>> >>>> -- >>>> >>>> Mairbek Khadikov | Software Engineer | [email protected] >>>> >>>> >> > > > -- > > Mairbek Khadikov | Software Engineer | [email protected] > >
