scovich commented on PR #7961:
URL: https://github.com/apache/arrow-rs/pull/7961#issuecomment-3089972179

   > While reviewing this I was thinking maybe we should revisit equality
   > 
   > I think what we are doing is trying to make `Variant::eq` to compare if 
the Variants are _logically_ equal (that is represent the same data), rather 
than comparing if the Varaints are _physically_ equal (that is represented with 
the same bytes)
    
   I agree that whatever we do should not be merely physical byte 
comparisons... but what does logical equality even mean? As in, if two variant 
objects compare logically equal, what can I do with that newfound knowledge? 
What goes wrong if I treat two variant objects as "equal" when they are not? 
etc.
   
   Also, it seems like there are several pitfalls in the spec that we'll have 
to worry about when trying to define logical equivalence
   * **Primitive values** -- the variant spec defines a notion of [equivalence 
class](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)
 that would almost certainly consider logical equality as a "user expression":
     > User expressions operating on an int8 value of 1 should behave the same 
as a decimal16 with scale 2 and unscaled value 100.
   
     ... but int, float and double are all in different equivalence classes, so 
by a strict reading of the spec `1f32` is _not_ equal to `1f64` is _not_ equal 
to `1i8` -- even tho they are _exactly_ the same value (???)
   * **Short string** -- The spec requires that
     > operating on a string value containing "hello" should behave the same, 
whether it is encoded with the short string optimization, or long string 
encoding.
   
     ... so we'd have to add that special case to any logical comparison
   * **Array** -- `is_large` and `field_offset_size_minus_one` header fields 
both change the physical layout of the bytes that follow, but do not impact 
logical equivalence. Further, one must (recursively) logically compare each 
element as well, because equivalent objects can have physically different 
bytes. 
   * **Object** -- Same considerations as Array, but also with 
`field_id_size_minus_one`. Additionally, field ids become _highly_ problematic: 
     * One could claim that two objects are the same if their field ids are the 
same. But that's only true if they both were encoded using the same metadata 
dictionary!
     * One could claim that two objects are _not_ the same if their field ids 
differ. But that wouldn't necessarily be true if they were encoded with 
different metadata dictionaries. It might not even be true when both were 
encoded with the same (unordered) metadata dictionary, because the same string 
could appear multiple times with different field ids.
     * One could do actual string comparisons, probing the objects' respective 
metadata dictionaries. But to what end? Even if they compare logically equal, 
it's not safe to treat them as equivalent (e.g. by copying a field's bytes from 
variant object to another), because the field ids would need to be rewritten -- 
recursively -- or they become garbage.
   
   That last part is my biggest worry -- if it's not safe to physically swap 
the bytes of two logically equivalent variant objects (because the field ids 
could go out of sync), what use is logical equivalence? I suspect this is why 
the [Delta 
spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#compatibility-with-other-delta-features)
 states that "variant is not a comparable data type" and does not support 
variant in any context that requires comparisons (partition columns, 
clustering, etc). I couldn't find any Databricks documentation specifying the 
behavior of variant comparisons.
   
   > If we are trying to make `Variant::eq` compare logically, I think we 
shouldn't be comparing `VariantMetadata` at all (as in remove this)
   
   I tend to agree that metadata dictionaries are not, by themselves, 
comparable. Any attempt to make them comparable would have to be partially 
physical and partly logical:
   * value of `sorted_strings` must match. Technically, it's possible an 
unsorted one could be equivalent to a sorted one, if its strings happened to be 
sorted... that's a rare enough case I don't think it's worth optimizing for.
   * `offset_size_minus_one` should _not_ influence the comparison
   * `dictionary_size` must match
   * the _unpacked_ offsets must all match (they may not be encoded using the 
same number of bytes)
   * the `bytes` must all match
   
   That way, two metadata dictionaries only compare equal if they contain the 
same strings _and_ they assign the same field ids to those strings. Such a 
logical comparison makes it safe to swap the bytes of one metadata dictionary 
with the bytes of another that compares logically equal, e.g. to improve 
parquet dictionary encoding of the field. But I'm not sure that would happen 
often enough to be worth optimizing for? Especially because (for unordered 
metadata at least) one would likely want the ability to replace a metadata 
dictionary with a different one that provides a superset of field names (with 
matching field ids in the common part).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to