Re: [PR] GH-40695 [C++] Expand Substrait type support [arrow]

via GitHub Fri, 22 Mar 2024 06:34:41 -0700


westonpace commented on code in PR #40696:
URL: https://github.com/apache/arrow/pull/40696#discussion_r1535576284



##########
format/substrait/extension_types.yaml:
##########
@@ -42,29 +42,48 @@
 # (but that is an infinite space). Similarly, we would have to declare a
 # timestamp variation for all possible timezone strings.
 
-type_variations:
-  - parent: i8
-    name: u8
-    description: an unsigned 8 bit integer
-    functions: SEPARATE
-  - parent: i16
-    name: u16
-    description: an unsigned 16 bit integer
-    functions: SEPARATE
-  - parent: i32
-    name: u32
-    description: an unsigned 32 bit integer
-    functions: SEPARATE
-  - parent: i64
-    name: u64
-    description: an unsigned 64 bit integer
-    functions: SEPARATE
+# Certain Arrow data types are, from Substrait's point of view, encodings.
+# These include dictionary, the view types (e.g. binary view, list view),
+# and REE.
+#
+# These types are not logically distinct from the type they are encoding.
+# Specifically:
+#  *  There is no value in the decoded type that cannot be represented
+#     in the encoded type and vice versa.
+#  *  Functions have the same meaning when applied to the encoded type
+# 
+# These types will never have a Substrait equivalent.  In the Substrait point
+# of view these are execution details.
+
+# The following types are encodings:
+
+# binary_view
+# list_view
+# dictionary
+# ree
 
-  - parent: i16
-    name: fp16
-    description: a 16 bit floating point number
-    functions: SEPARATE
+# Arrow-cpp's Substrait serde does not yet handle parameterized UDFs.  This 
means
+# the following types are not yet supported but may be supported in the future.
+# We define them below in case other implementations support them in the 
meantime.
 
+# decimal256
+# large_list
+# fixed_size_list
+# duration
+# time32 - not technically a parameterized type, but unsupported for similar 
reasons
+
+# Other types are not encodings, but are not first-class in Substrait.  These
+# types are often similar to existing Substrait types but define a different 
range
+# of values.  For example, unsigned integer types are very similar to their 
integer
+# counterparts, but have a different range of values.  These types are defined 
here
+# as extension types.
+#

Review Comment:
   I think I explain this above (I have updated the wording slightly)?
   
   ```
   # Certain Arrow data types are, from Substrait's point of view, encodings.
   # These include dictionary, the view types (e.g. binary view, list view),
   # and REE.
   #
   # These types are not logically distinct from the type they are encoding.
   # Specifically, the types meet the following criteria:
   #  *  There is no value in the decoded type that cannot be represented
   #     as a value in the encoded type and vice versa.
   #  *  Functions have the same meaning when applied to the encoded type
   #
   # Note: if two types have a different range (e.g. string and large_string) 
then
   # they do not satisfy the above criteria and are not encodings. 
   #
   # These types will never have a Substrait equivalent.  In the Substrait point
   # of view these are execution details.
   ```
   
   So `large_string` and `string` are different types because 
`concat(<string-with-2B-characters>, 'x')` will have a different output for 
`string` and `large_string` (it will output an `error` given `string` and a 
valid value given `large_string`).  However, there are no possible inputs that 
could lead to a different function output between `string` and `string_view`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40695 [C++] Expand Substrait type support [arrow]

Reply via email to