> On June 16, 2014, 9:45 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/client/cli/jobs.py, lines 181-182
> > <https://reviews.apache.org/r/22457/diff/3/?file=609592#file609592line181>
> >
> >     I don't think it's enough to json-serialize a thrift task. This is 
> > bound to set/dict serialization sorting problem we have battled with in 
> > updater.py. Specifically this block: 
> > https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/api/updater.py#L200-L223
> >     
> >     Without sorting, the diff will only work 99% of the time. This has been 
> > demonstrated in a few real-life job updates.
> 
> Mark Chu-Carroll wrote:
>     This isn't a new thing in this updated diff system - this is the current 
> behavior.
>     
>     If you use the new json-tree-diff, you won't have this problem; but for 
> people who rely on the current diff
>     behavior, I'd rather not change that.
>
> 
> Maxim Khutornenko wrote:
>     | If you use the new json-tree-diff, you won't have this problem; 
>     
>     Perhaps I am missing something but how would the new approach ensure 
> equivalence of these two structs?
>     { 'set1': { ['k1':'v1', 'k2':'v2']} }
>     { 'set2': { ['k2':'v2', 'k1':'v1']} }
>     
>     It's "normal" to see element re-ordering like this during thrift struct 
> serialization.
> 
> Maxim Khutornenko wrote:
>     Here is a real TaskConfig json fragment and its reordered twin:
>     
>     1: u'constraints': [{u'name': u'value', u'constraint': {u'values': [u'1', 
> u'2']}}, {u'name': u'limit', u'constraint': {u'limit': {u'limit': 10}}}]
>     2: u'constraints': [{u'constraint': {u'values': [u'2', u'1']}, u'name': 
> u'value'}, {u'name': u'limit', u'constraint': {u'limit': {u'limit': 10}}}]
> 
> Mark Chu-Carroll wrote:
>     I don't think that the reordering that you show in your first example 
> isn't valid json. Key-value pairs belong in a dictionary, not a list.  In a 
> dictionary, the comparison routine does the right thing: it compares key by 
> key, without regard to ordering.
>     
>     In the second example: I'd argue that that's actually a bug in our 
> serializer: it's using an ordered list for an unordered constraint. Diff 
> can't tell that the list structure isn't really supposed to be ordered; the 
> serializer should be canonicalizing it.
>
> 
> Maxim Khutornenko wrote:
>     Not sure how else you could represent a python set in JSON without 
> completely changing the underlying data structure (e.g. emitting fake keys 
> for every element and converting set into a dictionary with sorted keys). 
> Agree, the TJSONProtocol could be a bit smarter when dealing with python sets 
> but the reality is that python sets are inherently "unhashable" structs. I 
> have seen cases when completely repr-equivalent python sets were not 
> comparing as equal.
> 
> Mark Chu-Carroll wrote:
>     All the serializer would need to do is just canonicalize when it converts 
> to a list. It's not an issue of being unhashable. But it should at least 
> bother to be *uniform*. If you're converting a set to a list, chose a 
> convention for ordering elements. Put 'em in lexicographic order, problem 
> solved.
>     
>     As it stands, that problem is unsolvable for this diff tool: there's no 
> way to tell that a given field was supposed to be a set, and should be 
> compared without ordering. 
>     
>     I'm not sure what you'd like me to do here, short of giving up and 
> throwing away the new diff tool. I don't see any way to solve this, short of 
> creating something that uses thrift introspection.
>
> 
> Maxim Khutornenko wrote:
>     We solved this problem in updater diff before (see the link above) and I 
> am sure it can be solved here. Converting to sorted tuples does not produce a 
> pretty diff but if we are targeting value-comparison diff output rather than 
> serialized string compare the problem shifts into pretty-fying the sorted 
> tuple output rather than handling diff mechanics.
> 
> Mark Chu-Carroll wrote:
>     Not sure what you mean by "converting to sorted tuples". Is there any 
> documentation of what the "sorted tuple" format looks like? Or do I just need 
> to reverse-engineer? (I hate doing that, because if I make any wrong 
> assumptions, things get ugly.)
>
> 
> Maxim Khutornenko wrote:
>     I have modified the _diff_configs() in updater.py to print out the output 
> and ran "./pants src/test/python/apache/aurora/client/api:updater -s -k 
> test_diff_unordered_configs":
>     
>     ((u'constraints', (((u'constraint', ((u'limit', ((u'limit', 10),)),)), 
> (u'name', u'limit')), ((u'constraint', ((u'values', (u'1', u'2')),)), 
> (u'name', u'value')))), (u'contactEmail', u'f...@bar.com'), (u'diskMb', 1), 
> (u'environment', u'test'), (u'executorConfig', ((u'data', u'test data'), 
> (u'name', u'test'))), (u'jobName', u'jimbob'), (u'maxTaskFailures', 1), 
> (u'metadata', (((u'key', u'k1'), (u'value', u'v1')), ((u'key', u'k2'), 
> (u'value', u'v2')))), (u'numCpus', 1.0), (u'owner', ((u'role', u'mesos'),)), 
> (u'priority', 0), (u'production', 1), (u'ramMb', 1), (u'requestedPorts', 
> (u'142', u'3424', u'45235')), (u'taskLinks', ((u'task1', u'link1'), 
> (u'task2', u'link2')))) 
>     
>     The idea here is to eliminate the ordering unpredictability by replacing 
> all thrift structs with tuples, which are sortable. It sure isn't pretty but 
> it works.
> 
> Mark Chu-Carroll wrote:
>     I understand the reason for doing it; what I still don't understand is 
> just what the format here *is*. Is there any documentation on it? I can make 
> a reasonable guess at just what the "tuplization" format is and what it 
> means, but it's just a reasonable guess, based on looking at sample outputs 
> and reverse engineering.
>     
>     I don't understand how this fixes anything. As far as I can tell, what it 
> does is just replace the sets, dicts, and structs with tuples. But that 
> doesn't change anything about consistency, as far as I can understand.
>     
>     Let me try to ask in a slightly different way, to get at what I'm not 
> understanding. How does this solve the problem? At some point, you're still 
> taking the set of values whose order is unpredictable, and stuffing them into 
> a tuple, which is ordered. What about this format guarantees that the order 
> in which the elements of a set are converted into a tuple will be consistent?
>     
>     In the example you gave before, you've got a constraints field, which is 
> a set of strings. In normal JSON serialization, that turns into:
>        {u'values': [u'2', u'1']}
>     
>     In the serialization process, the json serializer took a field which was 
> actually a set([u'1', u'2']), and serialized it, producing [u'2', u'1']. The 
> serialization into a list imposed an order on the set. The problem is that we 
> can't count on consistency in how that order is chosen.
>     
>     Now, in this new serialization format, we're doing something very similar 
> - except that instead of serializing set([u'1', u'2']) into [u'1', u'2'], it 
> serializes it as (u'1', u'2'). To serialize it into a tuple, the serialized 
> needed to chose an order for the elements of the set. If I understand you 
> correctly, you're saying that this tuplized format solves the 
> ordering-consistency problem, somehow. But I don't see how. What about tuples 
> guarantees order consistency where lists don't?
>     
>     
>     
>     
>
> 
> Maxim Khutornenko wrote:
>     The key point here is that tuples are sorted at every step during 
> construction and specifically this line [1] ensures the consistency of sets 
> on compare. So, the {u'values': [u'2', u'1']} becomes ((u'values', (u'1', 
> u'2')),)) and so on when unwinding the recursion. 
>     
>     [1] - 
> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/api/updater.py#L202
>     
>     There is no documentation besides what is already there in the 
> _diff_configs(). The concept is quite simple, so I am not sure what else 
> needs to be documented there. Perhaps add an output example?
> 
> Mark Chu-Carroll wrote:
>     An output example would definitely help.
>     
>     Also, more on the rationale - the explanation in that code doesn't 
> explain why it makes to go so far - particularly in the case of the new diff 
> algorithm that we're discussing.
>     
>     For example: why convert lists to tuples? Lists have the same ordering 
> guarantees and comparison guarantees as tuples. So you're not increasing the 
> comparability, but you are making the resulting diffs harder to read?
>     
>     (In fact, what's going on in that _hashable and _diff_configs seems to 
> not just be over-aggressive, but incorrect. A list [b, a] in a constraint 
> expression can't be converted to [a, b]. As it stands, that will claim no 
> diffs in cases where there's a significant diff, because it erases ordering 
> even when it shouldn't.)
>     
>     Why convert dictionaries to tuples? Dictionaries don't have an ordering 
> constraint; key-by-key comparison is fine. If you're dumping the config out 
> to a file for comparison by the unix diff tool, that sort-of makes sense, 
> because you need the keys to be in the same order for a text diff. But why 
> convert to a tuple of tuples? Why not just do a list of tuples? Again, 
> comparison wise it's equivalent, but readability-wise, a list of pairs is 
> provides an extra bit of visual cue about the structure, which is helpful for 
> the person reading the diff.
>     
>
> 
> Maxim Khutornenko wrote:
>     The specific case we were trying to solve in updater diff did not require 
> list support (as TaskConfig does not have them). So your observation is 
> correct, lists should not have been there. The correct implementation should 
> have lists as a separate branch where the element order would be preserved.
>     
>     As for the dictionaries, they have to be flattened as they have the same 
> inherent problems as sets. I have observed exactly the same unpredictable 
> re-ordering behavior in both sets and dicts when converted to json.
>     
>     As for the "tuples of tuples" approach, it was chosen for simplicity. The 
> diff results are not normally shown in the output so readability was not 
> critical in updater.
> 
> Mark Chu-Carroll wrote:
>     Ok - so the dictionary flatting part doesn't apply in the new diff 
> routine, because like-key comparison doesn't depend on key ordering.
>     
>     But the set issues are a problem. Based on your experience here, will 
> set-flattening-and-ordering be enough to get correct results?
>

As long as dict keys are compared consistently and values are compared 
recursively, I think set flattening should be enough.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22457/#review45835
-----------------------------------------------------------


On June 18, 2014, 2:59 p.m., Mark Chu-Carroll wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22457/
> -----------------------------------------------------------
> 
> (Updated June 18, 2014, 2:59 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Brian Wickman.
> 
> 
> Bugs: aurora-520
>     https://issues.apache.org/jira/browse/aurora-520
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new diff method, which uses field-by-field comparison of JSON trees for 
> comparing running job configurations to potentially updated configs.
> 
> - Allow exclusion of semantically irrelevant fields.
> - Provide a clearer list of the differences between configs.
> - Provide a scripting-friendly alternative JSON syntax for diffs.
> 
> The old diff behavior is still available under the "--use-shell-diff" option.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/cli/BUILD 
> ebe681a0d1735b7cc695dc3b7a14c4292d87ae32 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> 4fa03a6c9919651551238b0dc211ed69a8dfe565 
>   src/main/python/apache/aurora/client/cli/json_tree_diff.py PRE-CREATION 
>   src/test/python/apache/aurora/client/cli/BUILD 
> 3c88ed7cf9f654bbbd80d1d44aa1dd1c8655e378 
>   src/test/python/apache/aurora/client/cli/test_diff.py 
> 38629b63c082cf81cb891dace2a70d9e8f418e18 
>   src/test/python/apache/aurora/client/cli/test_json_diff.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22457/diff/
> 
> 
> Testing
> -------
> 
> New unit tests of the JSON tree diff code, plus a bunch of new "job diff" 
> tests of the new functionality.
> All tests pass.
> 
> 
> Thanks,
> 
> Mark Chu-Carroll
> 
>

Reply via email to