Kenn - I agree that consistentWithEquals() is redundant w.r.t. structuralValue(), and should be deprecated. I think our mutation detectors are already using structuralValue(), so the work here would be to simply mark the method deprecated, remove all remaining overrides in the SDK, and document that overriding the method is a no-op.
Also agree with Luke that we should document that SerializableCoder should be used only for objects that have a proper equals(). Because we need *some* structural value for mutation detection, and since this coder is non-deterministic, it can't provide any structural value other than the object itself. In this case, I suppose, the work would involve just documentation. Not sure if we need anything extra in the direct runner for verification: won't existing mutation detectors already fire false positives in case these properties are violated? On Tue, Nov 28, 2017 at 11:03 AM Lukasz Cwik <[email protected]> wrote: > I think that at least we should be clear in the documentation for > SerializableCoder and also make sure that the DirectRunner validates the > consistentWithEquals property. > > Optionally one of: > 1) Make a version of SerializableCoder that can be constructed where it > says it is consistentWithEquals and have users register each type with the > CoderRegistry. > 2) Document that users subclass SerializableCoder for all types which are > consistentWtihEquals and also register them with the CoderRegistry. > > > On Mon, Nov 27, 2017 at 5:39 PM, Kenneth Knowles <[email protected]> wrote: > >> What I said is not quite right - there are accidental collisions allowed. >> The "all coders" spec for structural value only requires that encode(a) == >> encode(b) implies sv(a).equals(sv(b)). The converse is not required. For >> example, the nondeterministic SetCoder can use the Set objects themselves >> as structural values, but their encoding may differ. So for determinism it >> is actually a.equals(b) implies encode(a) == encode(b) which in turn >> implies sv(a).equals(sv(b)). Either way, for deterministic coders they all >> coincide. >> >> On Mon, Nov 27, 2017 at 5:23 PM, Kenneth Knowles <[email protected]> wrote: >> >>> To add some flavor, >>> >>> *All coders:* structuralValue(a).equals(structuralValue(b)) if and only >>> if encode(a) == encode(b) >>> >>> *"Consistent with equals" aka injective:* encode(a) == encode(b) >>> implies a.equals(b) >>> >>> *Deterministic:* a.equals(b) implies >>> structuralValue(a).equals(structuralValue(b)) (hence encode(a) == encode(b)) >>> >>> The structural value must always be a legitimate substitute for encoding >>> to allow in-memory GBK to be faster than encoding. >>> >>> IMO we should deprecate and retire "consistent with equals" since >>> overriding it to return `true` is no simpler than overriding >>> structuralValue itself, and it has no purpose other than governing >>> structuralValue. The two obvious choices - encoding or return directly - >>> are trivial, and getting fancy is optional. The check Luke suggests would >>> then just be a test that structuralValue is correct. The mutation detector >>> should perhaps just use the structural value and let the coder itself >>> decide whether or not it needs to encode. >>> >>> Also worth considering the dual perspective that highlights portability: >>> To a portable runner, the elements are (with a couple exceptions) just >>> bytes, and the coders are a way for the SDK to interpret them in order to >>> do its computation. The implied spec that the mutation detector relies on >>> is that serialize(deserialize(x)) == x for these bytes, so if the >>> re-serialized bytes have changed, it assumes the object was mutated. In a >>> sense, if an SDK implements "the identity function" yet returns different >>> bytes, that is a broken identity function because the bytes *are* the >>> element. It is a bit of a strict interpretation, and maybe not so useful >>> when the elements are only really interpreted by a single SDK, as in the >>> case of SerializableCoder. But I'm not sure what other spec is available. >>> >>> Kenn >>> >>> >>> On Mon, Nov 27, 2017 at 4:37 PM, Mairbek Khadikov <[email protected]> >>> wrote: >>> >>>> I'm open to renaming *consistentWithEquals*. >>>> >>>> If I understand the code correctly, when consistentWithEquals returns >>>> true, org.apache.beam.sdk.util.MutationDetectors expects >>>> *a.equals(deserialize(serialize(a))* which I think is reasonable for >>>> SerializableCoder (assuming objects implement equals)*. *Right now, >>>> *serialize(a).equals(serialize(deserialize(serialize(a)))* is expected >>>> and that contradicts *"does not guarantee a deterministic encoding"*. >>>> >>>> On Mon, Nov 27, 2017 at 4:07 PM, Lukasz Cwik <[email protected]> wrote: >>>> >>>>> I think the idea is that SerializableCoder should be updated to expect >>>>> that all values it encodes do implement equals() since this seems to be >>>>> the >>>>> much more common case then classes that don't implement a useful equals. >>>>> It >>>>> would be possible to add a useful check to DirectRunner that any value >>>>> that >>>>> says its consistent with equals actually obeys its contract. >>>>> >>>>> On Mon, Nov 27, 2017 at 4:03 PM, Eugene Kirpichov < >>>>> [email protected]> wrote: >>>>> >>>>>> Not sure where you see the contradiction? consistentWithEquals says >>>>>> "Whenever the encoded bytes of two values are equal, then the original >>>>>> values are equal according to {@code Objects.equals()}." - which is >>>>>> clearly >>>>>> false for Serializable's in general: it's possible that serialized form >>>>>> of >>>>>> "a" and "b" is the same bytes, but !a.equals(b), e.g. if this class does >>>>>> not implement equals() or if it uses reference equality. >>>>>> >>>>>> On Mon, Nov 27, 2017 at 3:55 PM Mairbek Khadikov <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Currently SerializableCoder#consistentWithEquals returns false, >>>>>>> which contradicts it's own documentation "{@link SerializableCoder} does >>>>>>> not guarantee a deterministic encoding, as Java serialization may >>>>>>> produce >>>>>>> different binary encodings for two equivalent objects". >>>>>>> >>>>>>> In practice, it leads to failures in >>>>>>> org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector. >>>>>>> For >>>>>>> example, some libraries cache hash values >>>>>>> https://github.com/google/protobuf/pull/3939/files#diff-24c442a23a43e041e4f50d324638fedbR142 >>>>>>> >>>>>>> I'm thinking about changing SerializableCoder#consistentWithEquals >>>>>>> to return true, so that the structural value would be the object itself >>>>>>> instead of it the serialized bytes. Thoughts? >>>>>>> >>>>>>> WIP branch https://github.com/mairbek/beam/tree/ser >>>>>>> >>>>>>> Sincerely, >>>>>>> Mairbek >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Mairbek Khadikov | Software Engineer | [email protected] >>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Mairbek Khadikov | Software Engineer | [email protected] >>>> >>>> >>> >> >
