https://issues.apache.org/jira/browse/BEAM-3279 has been filed for consistentWithEquals https://issues.apache.org/jira/browse/BEAM-3385 has been filed for SerializableCoder
On Tue, Nov 28, 2017 at 1:20 PM, Eugene Kirpichov <[email protected]> wrote: > Kenn - I agree that consistentWithEquals() is redundant w.r.t. > structuralValue(), and should be deprecated. I think our mutation detectors > are already using structuralValue(), so the work here would be to simply > mark the method deprecated, remove all remaining overrides in the SDK, and > document that overriding the method is a no-op. > > Also agree with Luke that we should document that SerializableCoder should > be used only for objects that have a proper equals(). Because we need > *some* structural value for mutation detection, and since this coder is > non-deterministic, it can't provide any structural value other than the > object itself. In this case, I suppose, the work would involve just > documentation. > > Not sure if we need anything extra in the direct runner for verification: > won't existing mutation detectors already fire false positives in case > these properties are violated? > > On Tue, Nov 28, 2017 at 11:03 AM Lukasz Cwik <[email protected]> wrote: > >> I think that at least we should be clear in the documentation for >> SerializableCoder and also make sure that the DirectRunner validates the >> consistentWithEquals property. >> >> Optionally one of: >> 1) Make a version of SerializableCoder that can be constructed where it >> says it is consistentWithEquals and have users register each type with the >> CoderRegistry. >> 2) Document that users subclass SerializableCoder for all types which are >> consistentWtihEquals and also register them with the CoderRegistry. >> >> >> On Mon, Nov 27, 2017 at 5:39 PM, Kenneth Knowles <[email protected]> wrote: >> >>> What I said is not quite right - there are accidental collisions >>> allowed. The "all coders" spec for structural value only requires that >>> encode(a) == encode(b) implies sv(a).equals(sv(b)). The converse is not >>> required. For example, the nondeterministic SetCoder can use the Set >>> objects themselves as structural values, but their encoding may differ. So >>> for determinism it is actually a.equals(b) implies encode(a) == encode(b) >>> which in turn implies sv(a).equals(sv(b)). Either way, for deterministic >>> coders they all coincide. >>> >>> On Mon, Nov 27, 2017 at 5:23 PM, Kenneth Knowles <[email protected]> wrote: >>> >>>> To add some flavor, >>>> >>>> *All coders:* structuralValue(a).equals(structuralValue(b)) if and >>>> only if encode(a) == encode(b) >>>> >>>> *"Consistent with equals" aka injective:* encode(a) == encode(b) >>>> implies a.equals(b) >>>> >>>> *Deterministic:* a.equals(b) implies >>>> structuralValue(a).equals(structuralValue(b)) >>>> (hence encode(a) == encode(b)) >>>> >>>> The structural value must always be a legitimate substitute for >>>> encoding to allow in-memory GBK to be faster than encoding. >>>> >>>> IMO we should deprecate and retire "consistent with equals" since >>>> overriding it to return `true` is no simpler than overriding >>>> structuralValue itself, and it has no purpose other than governing >>>> structuralValue. The two obvious choices - encoding or return directly - >>>> are trivial, and getting fancy is optional. The check Luke suggests would >>>> then just be a test that structuralValue is correct. The mutation detector >>>> should perhaps just use the structural value and let the coder itself >>>> decide whether or not it needs to encode. >>>> >>>> Also worth considering the dual perspective that highlights >>>> portability: To a portable runner, the elements are (with a couple >>>> exceptions) just bytes, and the coders are a way for the SDK to interpret >>>> them in order to do its computation. The implied spec that the mutation >>>> detector relies on is that serialize(deserialize(x)) == x for these bytes, >>>> so if the re-serialized bytes have changed, it assumes the object was >>>> mutated. In a sense, if an SDK implements "the identity function" yet >>>> returns different bytes, that is a broken identity function because the >>>> bytes *are* the element. It is a bit of a strict interpretation, and maybe >>>> not so useful when the elements are only really interpreted by a single >>>> SDK, as in the case of SerializableCoder. But I'm not sure what other spec >>>> is available. >>>> >>>> Kenn >>>> >>>> >>>> On Mon, Nov 27, 2017 at 4:37 PM, Mairbek Khadikov <[email protected]> >>>> wrote: >>>> >>>>> I'm open to renaming *consistentWithEquals*. >>>>> >>>>> If I understand the code correctly, when consistentWithEquals returns >>>>> true, org.apache.beam.sdk.util.MutationDetectors expects >>>>> *a.equals(deserialize(serialize(a))* which I think is reasonable for >>>>> SerializableCoder (assuming objects implement equals)*. *Right now, >>>>> *serialize(a).equals(serialize(deserialize(serialize(a)))* is >>>>> expected and that contradicts *"does not guarantee a deterministic >>>>> encoding"*. >>>>> >>>>> On Mon, Nov 27, 2017 at 4:07 PM, Lukasz Cwik <[email protected]> wrote: >>>>> >>>>>> I think the idea is that SerializableCoder should be updated to >>>>>> expect that all values it encodes do implement equals() since this seems >>>>>> to >>>>>> be the much more common case then classes that don't implement a useful >>>>>> equals. It would be possible to add a useful check to DirectRunner that >>>>>> any >>>>>> value that says its consistent with equals actually obeys its contract. >>>>>> >>>>>> On Mon, Nov 27, 2017 at 4:03 PM, Eugene Kirpichov < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Not sure where you see the contradiction? consistentWithEquals says >>>>>>> "Whenever the encoded bytes of two values are equal, then the original >>>>>>> values are equal according to {@code Objects.equals()}." - which is >>>>>>> clearly >>>>>>> false for Serializable's in general: it's possible that serialized form >>>>>>> of >>>>>>> "a" and "b" is the same bytes, but !a.equals(b), e.g. if this class does >>>>>>> not implement equals() or if it uses reference equality. >>>>>>> >>>>>>> On Mon, Nov 27, 2017 at 3:55 PM Mairbek Khadikov <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> Currently SerializableCoder#consistentWithEquals returns false, >>>>>>>> which contradicts it's own documentation "{@link SerializableCoder} >>>>>>>> does >>>>>>>> not guarantee a deterministic encoding, as Java serialization may >>>>>>>> produce >>>>>>>> different binary encodings for two equivalent objects". >>>>>>>> >>>>>>>> In practice, it leads to failures in org.apache.beam.sdk.util. >>>>>>>> MutationDetectors$CodedValueMutationDetector. For example, some >>>>>>>> libraries cache hash values https://github.com/ >>>>>>>> google/protobuf/pull/3939/files#diff-24c442a23a43e041e4f50d324638fe >>>>>>>> dbR142 >>>>>>>> >>>>>>>> I'm thinking about changing SerializableCoder#consistentWithEquals >>>>>>>> to return true, so that the structural value would be the object itself >>>>>>>> instead of it the serialized bytes. Thoughts? >>>>>>>> >>>>>>>> WIP branch https://github.com/mairbek/beam/tree/ser >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Mairbek >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Mairbek Khadikov | Software Engineer | [email protected] >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Mairbek Khadikov | Software Engineer | [email protected] >>>>> >>>>> >>>> >>> >>
