Re: [PR] [Variant] Revisit VariantMetadata and Object equality [arrow-rs]

via GitHub Fri, 18 Jul 2025 09:12:34 -0700


scovich commented on PR #7961:
URL: https://github.com/apache/arrow-rs/pull/7961#issuecomment-3089972179

> While reviewing this I was thinking maybe we should revisit equality
>
> I think what we are doing is trying to make `Variant::eq` to compare if
the Variants are _logically_ equal (that is represent the same data), rather
than comparing if the Varaints are _physically_ equal (that is represented with
the same bytes)

I agree that whatever we do should not be merely physical byte
comparisons... but what does logical equality even mean? As in, if two variant
objects compare logically equal, what can I do with that newfound knowledge?
What goes wrong if I treat two variant objects as "equal" when they are not?
etc.

Also, it seems like there are several pitfalls in the spec that we'll have
to worry about when trying to define logical equivalence
* **Primitive values** -- the variant spec defines a notion of [equivalence
class](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types)
that would almost certainly consider logical equality as a "user expression":
> User expressions operating on an int8 value of 1 should behave the same
as a decimal16 with scale 2 and unscaled value 100.

... but int, float and double are all in different equivalence classes, so
by a strict reading of the spec `1f32` is _not_ equal to `1f64` is _not_ equal
to `1i8` -- even tho they are _exactly_ the same value (???)
* **Short string** -- The spec requires that
> operating on a string value containing "hello" should behave the same,
whether it is encoded with the short string optimization, or long string
encoding.

... so we'd have to add that special case to any logical comparison
* **Array** -- `is_large` and `field_offset_size_minus_one` header fields
both change the physical layout of the bytes that follow, but do not impact
logical equivalence. Further, one must (recursively) logically compare each
element as well, because equivalent objects can have physically different
bytes.
* **Object** -- Same considerations as Array, but also with
`field_id_size_minus_one`. Additionally, field ids become _highly_ problematic:
* One could claim that two objects are the same if their field ids are the
same. But that's only true if they both were encoded using the same metadata
dictionary!
* One could claim that two objects are _not_ the same if their field ids
differ. But that wouldn't necessarily be true if they were encoded with
different metadata dictionaries. It might not even be true when both were
encoded with the same (unordered) metadata dictionary, because the same string
could appear multiple times with different field ids.
* One could do actual string comparisons, probing the objects' respective
metadata dictionaries. But to what end? Even if they compare logically equal,
it's not safe to treat them as equivalent (e.g. by copying a field's bytes from
variant object to another), because the field ids would need to be rewritten --
recursively -- or they become garbage.

That last part is my biggest worry -- if it's not safe to physically swap
the bytes of two logically equivalent variant objects (because the field ids
could go out of sync), what use is logical equivalence? I suspect this is why
the [Delta
spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#compatibility-with-other-delta-features)
states that "variant is not a comparable data type" and does not support
variant in any context that requires comparisons (partition columns,
clustering, etc). I couldn't find any Databricks documentation specifying the
behavior of variant comparisons.

> If we are trying to make `Variant::eq` compare logically, I think we
shouldn't be comparing `VariantMetadata` at all (as in remove this)

I tend to agree that metadata dictionaries are not, by themselves,
comparable. Any attempt to make them comparable would have to be partially
physical and partly logical:
* value of `sorted_strings` must match. Technically, it's possible an
unsorted one could be equivalent to a sorted one, if its strings happened to be
sorted... that's a rare enough case I don't think it's worth optimizing for.
* `offset_size_minus_one` should _not_ influence the comparison
* `dictionary_size` must match
* the _unpacked_ offsets must all match (they may not be encoded using the
same number of bytes)
* the `bytes` must all match

That way, two metadata dictionaries only compare equal if they contain the
same strings _and_ they assign the same field ids to those strings. Such a
logical comparison makes it safe to swap the bytes of one metadata dictionary
with the bytes of another that compares logically equal, e.g. to improve
parquet dictionary encoding of the field. But I'm not sure that would happen
often enough to be worth optimizing for? Especially because (for unordered
metadata at least) one would likely want the ability to replace a metadata
dictionary with a different one that provides a superset of field names (with
matching field ids in the common part).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [Variant] Revisit VariantMetadata and Object equality [arrow-rs]

Reply via email to