westonpace commented on code in PR #40696: URL: https://github.com/apache/arrow/pull/40696#discussion_r1535576284
########## format/substrait/extension_types.yaml: ########## @@ -42,29 +42,48 @@ # (but that is an infinite space). Similarly, we would have to declare a # timestamp variation for all possible timezone strings. -type_variations: - - parent: i8 - name: u8 - description: an unsigned 8 bit integer - functions: SEPARATE - - parent: i16 - name: u16 - description: an unsigned 16 bit integer - functions: SEPARATE - - parent: i32 - name: u32 - description: an unsigned 32 bit integer - functions: SEPARATE - - parent: i64 - name: u64 - description: an unsigned 64 bit integer - functions: SEPARATE +# Certain Arrow data types are, from Substrait's point of view, encodings. +# These include dictionary, the view types (e.g. binary view, list view), +# and REE. +# +# These types are not logically distinct from the type they are encoding. +# Specifically: +# * There is no value in the decoded type that cannot be represented +# in the encoded type and vice versa. +# * Functions have the same meaning when applied to the encoded type +# +# These types will never have a Substrait equivalent. In the Substrait point +# of view these are execution details. + +# The following types are encodings: + +# binary_view +# list_view +# dictionary +# ree - - parent: i16 - name: fp16 - description: a 16 bit floating point number - functions: SEPARATE +# Arrow-cpp's Substrait serde does not yet handle parameterized UDFs. This means +# the following types are not yet supported but may be supported in the future. +# We define them below in case other implementations support them in the meantime. +# decimal256 +# large_list +# fixed_size_list +# duration +# time32 - not technically a parameterized type, but unsupported for similar reasons + +# Other types are not encodings, but are not first-class in Substrait. These +# types are often similar to existing Substrait types but define a different range +# of values. For example, unsigned integer types are very similar to their integer +# counterparts, but have a different range of values. These types are defined here +# as extension types. +# Review Comment: I think I explain this above (I have updated the wording slightly)? ``` # Certain Arrow data types are, from Substrait's point of view, encodings. # These include dictionary, the view types (e.g. binary view, list view), # and REE. # # These types are not logically distinct from the type they are encoding. # Specifically, the types meet the following criteria: # * There is no value in the decoded type that cannot be represented # as a value in the encoded type and vice versa. # * Functions have the same meaning when applied to the encoded type # # Note: if two types have a different range (e.g. string and large_string) then # they do not satisfy the above criteria and are not encodings. # # These types will never have a Substrait equivalent. In the Substrait point # of view these are execution details. ``` So `large_string` and `string` are different types because `concat(<string-with-2B-characters>, 'x')` will have a different output for `string` and `large_string` (it will output an `error` given `string` and a valid value given `large_string`). However, there are no possible inputs that could lead to a different function output between `string` and `string_view`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
