scovich commented on PR #7961: URL: https://github.com/apache/arrow-rs/pull/7961#issuecomment-3089972179
> While reviewing this I was thinking maybe we should revisit equality > > I think what we are doing is trying to make `Variant::eq` to compare if the Variants are _logically_ equal (that is represent the same data), rather than comparing if the Varaints are _physically_ equal (that is represented with the same bytes) I agree that whatever we do should not be merely physical byte comparisons... but what does logical equality even mean? As in, if two variant objects compare logically equal, what can I do with that newfound knowledge? What goes wrong if I treat two variant objects as "equal" when they are not? etc. Also, it seems like there are several pitfalls in the spec that we'll have to worry about when trying to define logical equivalence * **Primitive values** -- the variant spec defines a notion of [equivalence class](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types) that would almost certainly consider logical equality as a "user expression": > User expressions operating on an int8 value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100. ... but int, float and double are all in different equivalence classes, so by a strict reading of the spec `1f32` is _not_ equal to `1f64` is _not_ equal to `1i8` -- even tho they are _exactly_ the same value (???) * **Short string** -- The spec requires that > operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding. ... so we'd have to add that special case to any logical comparison * **Array** -- `is_large` and `field_offset_size_minus_one` header fields both change the physical layout of the bytes that follow, but do not impact logical equivalence. Further, one must (recursively) logically compare each element as well, because equivalent objects can have physically different bytes. * **Object** -- Same considerations as Array, but also with `field_id_size_minus_one`. Additionally, field ids become _highly_ problematic: * One could claim that two objects are the same if their field ids are the same. But that's only true if they both were encoded using the same metadata dictionary! * One could claim that two objects are _not_ the same if their field ids differ. But that wouldn't necessarily be true if they were encoded with different metadata dictionaries. It might not even be true when both were encoded with the same (unordered) metadata dictionary, because the same string could appear multiple times with different field ids. * One could do actual string comparisons, probing the objects' respective metadata dictionaries. But to what end? Even if they compare logically equal, it's not safe to treat them as equivalent (e.g. by copying a field's bytes from variant object to another), because the field ids would need to be rewritten -- recursively -- or they become garbage. That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence? I suspect this is why the [Delta spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#compatibility-with-other-delta-features) states that "variant is not a comparable data type" and does not support variant in any context that requires comparisons (partition columns, clustering, etc). I couldn't find any Databricks documentation specifying the behavior of variant comparisons. > If we are trying to make `Variant::eq` compare logically, I think we shouldn't be comparing `VariantMetadata` at all (as in remove this) I tend to agree that metadata dictionaries are not, by themselves, comparable. Any attempt to make them comparable would have to be partially physical and partly logical: * value of `sorted_strings` must match. Technically, it's possible an unsorted one could be equivalent to a sorted one, if its strings happened to be sorted... that's a rare enough case I don't think it's worth optimizing for. * `offset_size_minus_one` should _not_ influence the comparison * `dictionary_size` must match * the _unpacked_ offsets must all match (they may not be encoded using the same number of bytes) * the `bytes` must all match That way, two metadata dictionaries only compare equal if they contain the same strings _and_ they assign the same field ids to those strings. Such a logical comparison makes it safe to swap the bytes of one metadata dictionary with the bytes of another that compares logically equal, e.g. to improve parquet dictionary encoding of the field. But I'm not sure that would happen often enough to be worth optimizing for? Especially because (for unordered metadata at least) one would likely want the ability to replace a metadata dictionary with a different one that provides a superset of field names (with matching field ids in the common part). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org