westonpace commented on code in PR #40696:
URL: https://github.com/apache/arrow/pull/40696#discussion_r1535611814
##########
format/substrait/extension_types.yaml:
##########
@@ -42,29 +42,48 @@
# (but that is an infinite space). Similarly, we would have to declare a
# timestamp variation for all possible timezone strings.
-type_variations:
- - parent: i8
- name: u8
- description: an unsigned 8 bit integer
- functions: SEPARATE
- - parent: i16
- name: u16
- description: an unsigned 16 bit integer
- functions: SEPARATE
- - parent: i32
- name: u32
- description: an unsigned 32 bit integer
- functions: SEPARATE
- - parent: i64
- name: u64
- description: an unsigned 64 bit integer
- functions: SEPARATE
+# Certain Arrow data types are, from Substrait's point of view, encodings.
+# These include dictionary, the view types (e.g. binary view, list view),
+# and REE.
+#
+# These types are not logically distinct from the type they are encoding.
+# Specifically:
+# * There is no value in the decoded type that cannot be represented
+# in the encoded type and vice versa.
Review Comment:
I ended up reverting this suggestion and refining the wording a little. I
am trying to explain here the criteria for "encoding" vs. "user defined type".
I guess I could be more mathematically formal:
Let T1 and T2 be two types.
There exist functions ENCODE and DECODE such that:
For every value x in T1 the value DECODE(ENCODE(x)) is equal to x
For every value y in T2 the value ENCODE(DECODE(y)) is equal to y
This is not true for something like uint8/int8 because there is no function
that can encode 128 from uint8 into int8.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]