bkietz commented on code in PR #40696:
URL: https://github.com/apache/arrow/pull/40696#discussion_r1533780643
##########
format/substrait/extension_types.yaml:
##########
@@ -42,29 +42,48 @@
# (but that is an infinite space). Similarly, we would have to declare a
# timestamp variation for all possible timezone strings.
-type_variations:
- - parent: i8
- name: u8
- description: an unsigned 8 bit integer
- functions: SEPARATE
- - parent: i16
- name: u16
- description: an unsigned 16 bit integer
- functions: SEPARATE
- - parent: i32
- name: u32
- description: an unsigned 32 bit integer
- functions: SEPARATE
- - parent: i64
- name: u64
- description: an unsigned 64 bit integer
- functions: SEPARATE
+# Certain Arrow data types are, from Substrait's point of view, encodings.
+# These include dictionary, the view types (e.g. binary view, list view),
+# and REE.
+#
+# These types are not logically distinct from the type they are encoding.
+# Specifically:
+# * There is no value in the decoded type that cannot be represented
+# in the encoded type and vice versa.
Review Comment:
Nit:
```suggestion
# * There is no column of the decoded type that cannot be represented
# as a column the encoded type and vice versa.
```
##########
format/substrait/extension_types.yaml:
##########
@@ -42,29 +42,48 @@
# (but that is an infinite space). Similarly, we would have to declare a
# timestamp variation for all possible timezone strings.
-type_variations:
- - parent: i8
- name: u8
- description: an unsigned 8 bit integer
- functions: SEPARATE
- - parent: i16
- name: u16
- description: an unsigned 16 bit integer
- functions: SEPARATE
- - parent: i32
- name: u32
- description: an unsigned 32 bit integer
- functions: SEPARATE
- - parent: i64
- name: u64
- description: an unsigned 64 bit integer
- functions: SEPARATE
+# Certain Arrow data types are, from Substrait's point of view, encodings.
+# These include dictionary, the view types (e.g. binary view, list view),
+# and REE.
+#
+# These types are not logically distinct from the type they are encoding.
+# Specifically:
+# * There is no value in the decoded type that cannot be represented
+# in the encoded type and vice versa.
+# * Functions have the same meaning when applied to the encoded type
+#
+# These types will never have a Substrait equivalent. In the Substrait point
+# of view these are execution details.
+
+# The following types are encodings:
+
+# binary_view
+# list_view
+# dictionary
+# ree
- - parent: i16
- name: fp16
- description: a 16 bit floating point number
- functions: SEPARATE
+# Arrow-cpp's Substrait serde does not yet handle parameterized UDFs. This
means
Review Comment:
```suggestion
# Arrow-cpp's Substrait serde does not yet handle parameterized UDTs. This
means
```
##########
format/substrait/extension_types.yaml:
##########
@@ -80,3 +99,74 @@ types:
months: i32
days: i32
nanos: i64
+ # All signed integer literals are encoded as user defined literals with
+ # a google.protobuf.UInt64Value message.
+ - name: i8
+ structure: {}
+ - name: i16
+ structure: {}
+ - name: i32
+ structure: {}
+ - name: i64
+ structure: {}
+ # fp16 literals are encoded as user defined literals with
+ # a google.protobuf.UInt32Value message where the lower 16 bits are
+ # the fp16 value.
+ - name: fp16
+ structure: {}
+ # 64-bit integers are big. Even though date64 stores ms and not days it
+ # can still represent about 50x more dates than date32. Since it has a
+ # different range of values, it is an extension type.
+ #
+ # date64 literals are encoded as user defined literals with
+ # a google.protobuf.UInt64Value message.
+ - name: date_millis
+ structure: {}
+ # We cannot generate `time` today for reasons similar to parameterized
+ # UDTs.
+ - name: time_seconds
+ structure: {}
+ - name: time_millis
+ structure: {}
+ - name: time_nanos
+ # Large string literals are encoded using a
+ # google.protobuf.StringValue message.
+ structure: {}
Review Comment:
```suggestion
structure: {}
# Large string literals are encoded using a
# google.protobuf.StringValue message.
```
##########
format/substrait/extension_types.yaml:
##########
@@ -80,3 +99,74 @@ types:
months: i32
days: i32
nanos: i64
+ # All signed integer literals are encoded as user defined literals with
+ # a google.protobuf.UInt64Value message.
+ - name: i8
+ structure: {}
+ - name: i16
+ structure: {}
+ - name: i32
+ structure: {}
+ - name: i64
+ structure: {}
+ # fp16 literals are encoded as user defined literals with
+ # a google.protobuf.UInt32Value message where the lower 16 bits are
+ # the fp16 value.
+ - name: fp16
+ structure: {}
+ # 64-bit integers are big. Even though date64 stores ms and not days it
+ # can still represent about 50x more dates than date32. Since it has a
+ # different range of values, it is an extension type.
+ #
+ # date64 literals are encoded as user defined literals with
+ # a google.protobuf.UInt64Value message.
+ - name: date_millis
+ structure: {}
+ # We cannot generate `time` today for reasons similar to parameterized
+ # UDTs.
+ - name: time_seconds
+ structure: {}
+ - name: time_millis
+ structure: {}
+ - name: time_nanos
+ # Large string literals are encoded using a
+ # google.protobuf.StringValue message.
+ structure: {}
+ - name: large_string
+ structure: {}
+ # Large binary literals are encoded using a
+ # google.protobuf.BytesValue message.
+ - name: large_binary
+ structure: {}
+ # We cannot generate these today because they are parameterized UDTs and
+ # substrait-cpp does not yet support parameterized UDTs.
+ - name: decimal256
+ structure: {}
+ parameters:
+ - name: precision
+ type: integer
+ min: 0
+ max: 76
+ - name: scale
+ type: integer
+ min: 0
+ max: 76
+ - name: large_list
+ structure: {}
+ parameters:
+ - name: value_type
+ type: dataType
+ - name: fixed size list
Review Comment:
```suggestion
- name: fixed_size_list
```
##########
python/pyarrow/tests/test_substrait.py:
##########
@@ -944,6 +944,31 @@ def test_serializing_expressions(expr):
assert "test_expr" in returned.expressions
+def test_arrow_specific_types():
+ schema = pa.schema(
+ [
+ pa.field("time_nanos", pa.time64("ns")),
+ pa.field("date_millis", pa.date64()),
+ pa.field("large_string", pa.large_string()),
+ pa.field("large_binary", pa.large_binary()),
Review Comment:
```suggestion
pa.field("large_binary", pa.large_binary()),
pa.field("string_view", pa.string_view()),
pa.field("binary_view", pa.binary_view()),
```
##########
cpp/src/arrow/engine/substrait/extension_set.cc:
##########
@@ -1043,6 +1049,10 @@ struct DefaultExtensionIdRegistry :
ExtensionIdRegistryImpl {
TypeName{uint32(), "u32"},
Review Comment:
Would you mind fixing the comment above?
`substrait/format/extension_types.yaml -> format/substrait/extension_types.yaml`
##########
python/pyarrow/tests/test_substrait.py:
##########
@@ -944,6 +944,31 @@ def test_serializing_expressions(expr):
assert "test_expr" in returned.expressions
+def test_arrow_specific_types():
+ schema = pa.schema(
+ [
+ pa.field("time_nanos", pa.time64("ns")),
+ pa.field("date_millis", pa.date64()),
+ pa.field("large_string", pa.large_string()),
+ pa.field("large_binary", pa.large_binary()),
+ ]
+ )
+
+ def check_round_trip(expr):
+ buf = pa.substrait.serialize_expressions([expr], ["test_expr"], schema)
+ returned = pa.substrait.deserialize_expressions(buf)
+ assert schema == returned.schema
+
+ check_round_trip(pc.field("large_string") == "test_string")
+ check_round_trip(pc.field("large_binary") == "test_string")
+ # Arrow-cpp supports round tripping these types but pyarrow doesn't support
+ # constructing literals of these types
+ #
+ # So best we can do is verify field references work
+ check_round_trip(pc.field("time_nanos"))
Review Comment:
I think using `pyarrow.scalar` would work
```suggestion
check_round_trip(pc.field("time_nanos") == pa.scalar(3,
type=pa.time64("ns")))
```
This could be made generic with
```python
fields = {
"large_string": (pa.large_string(), "test_string"),
# ...
}
schema = pa.schema([pa.field(name, typ) for name, (typ, _) in
fields.items()])
for name, (typ, val) in fields:
check_round_trip(pc.field(name) == pa.scalar(val, type=typ))
```
##########
cpp/src/arrow/engine/substrait/type_internal.cc:
##########
@@ -379,10 +455,13 @@ struct DataTypeToProtoImpl {
return NotImplemented(t);
}
+ // TODO(weston) support parameterized UDT
Review Comment:
```suggestion
// TODO(GH-$PLZ) support parameterized UDT
```
##########
cpp/src/arrow/engine/substrait/extension_set.cc:
##########
@@ -314,6 +314,12 @@ Result<uint32_t> ExtensionSet::EncodeType(const DataType&
type) {
return Status::KeyError("type ", type.ToString(), " not found in the
registry");
}
+Result<uint32_t> ExtensionSet::EncodeTypeId(Id type_id) {
+ RETURN_NOT_OK(this->AddUri(type_id));
+ auto it_success = types_map_.emplace(type_id,
static_cast<uint32_t>(types_map_.size()));
+ return it_success.first->second;
Review Comment:
```suggestion
auto [it, success] = types_map_.emplace(type_id,
static_cast<uint32_t>(types_map_.size()));
return it->second;
```
##########
cpp/src/arrow/engine/substrait/extension_set.cc:
##########
@@ -1043,6 +1049,10 @@ struct DefaultExtensionIdRegistry :
ExtensionIdRegistryImpl {
TypeName{uint32(), "u32"},
TypeName{uint64(), "u64"},
TypeName{float16(), "fp16"},
+ TypeName{large_utf8(), "large_string"},
+ TypeName{large_binary(), "large_binary"},
+ TypeName{date64(), "date_millis"},
+ TypeName{time64(TimeUnit::NANO), "time_nanos"},
Review Comment:
It is not obvious to me that we can't add time32:
```suggestion
TypeName{time32(TimeUnit::SECONDS), "time_seconds"},
```
##########
cpp/src/arrow/engine/substrait/expression_internal.cc:
##########
@@ -421,6 +427,65 @@ Result<compute::Expression> FromProto(const
substrait::Expression& expr,
expr.DebugString());
}
+namespace {
+struct UserDefinedLiteralToArrow {
+ Status Visit(const DataType& type) {
+ return Status::NotImplemented("User defined literals of type ", type);
+ }
+ Status Visit(const IntegerType& type) {
+ google::protobuf::UInt64Value value;
+ if (!user_defined_->value().UnpackTo(&value)) {
+ return Status::Invalid("Failed to unpack user defined integer literal to
uint64");
+ }
+ ARROW_ASSIGN_OR_RAISE(scalar_, MakeScalar(type.GetSharedPtr(),
value.value()));
+ return Status::OK();
+ }
+ Status Visit(const Date64Type& type) {
+ google::protobuf::UInt64Value value;
Review Comment:
It seems that this ought to be signed
##########
cpp/thirdparty/versions.txt:
##########
@@ -103,8 +103,8 @@ ARROW_RE2_BUILD_VERSION=2022-06-01
ARROW_RE2_BUILD_SHA256_CHECKSUM=f89c61410a072e5cbcf8c27e3a778da7d6fd2f2b5b1445cd4f4508bee946ab0f
ARROW_SNAPPY_BUILD_VERSION=1.1.10
ARROW_SNAPPY_BUILD_SHA256_CHECKSUM=49d831bffcc5f3d01482340fe5af59852ca2fe76c3e05df0e67203ebbe0f1d90
-ARROW_SUBSTRAIT_BUILD_VERSION=v0.27.0
-ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=4ed375f69d972a57fdc5ec406c17003a111831d8640d3f1733eccd4b3ff45628
+ARROW_SUBSTRAIT_BUILD_VERSION=v0.44.0
Review Comment:
Are there any extra crossbow jobs we should trigger to check this version
bump?
##########
format/substrait/extension_types.yaml:
##########
@@ -42,29 +42,48 @@
# (but that is an infinite space). Similarly, we would have to declare a
# timestamp variation for all possible timezone strings.
-type_variations:
- - parent: i8
- name: u8
- description: an unsigned 8 bit integer
- functions: SEPARATE
- - parent: i16
- name: u16
- description: an unsigned 16 bit integer
- functions: SEPARATE
- - parent: i32
- name: u32
- description: an unsigned 32 bit integer
- functions: SEPARATE
- - parent: i64
- name: u64
- description: an unsigned 64 bit integer
- functions: SEPARATE
+# Certain Arrow data types are, from Substrait's point of view, encodings.
+# These include dictionary, the view types (e.g. binary view, list view),
+# and REE.
+#
+# These types are not logically distinct from the type they are encoding.
+# Specifically:
+# * There is no value in the decoded type that cannot be represented
+# in the encoded type and vice versa.
+# * Functions have the same meaning when applied to the encoded type
+#
+# These types will never have a Substrait equivalent. In the Substrait point
+# of view these are execution details.
+
+# The following types are encodings:
+
+# binary_view
+# list_view
+# dictionary
+# ree
- - parent: i16
- name: fp16
- description: a 16 bit floating point number
- functions: SEPARATE
+# Arrow-cpp's Substrait serde does not yet handle parameterized UDFs. This
means
+# the following types are not yet supported but may be supported in the future.
+# We define them below in case other implementations support them in the
meantime.
+# decimal256
+# large_list
+# fixed_size_list
+# duration
+# time32 - not technically a parameterized type, but unsupported for similar
reasons
+
+# Other types are not encodings, but are not first-class in Substrait. These
+# types are often similar to existing Substrait types but define a different
range
+# of values. For example, unsigned integer types are very similar to their
integer
+# counterparts, but have a different range of values. These types are defined
here
+# as extension types.
+#
Review Comment:
Could you explain why large_binary should be an extension type but
binary_view should only be an encoding? I think it'd provide a useful guide for
future authors who need to pick where to put a type
##########
cpp/src/arrow/engine/substrait/extension_set.h:
##########
@@ -408,6 +412,13 @@ class ARROW_ENGINE_EXPORT ExtensionSet {
/// \return An anchor that can be used to refer to the type within a plan
Result<uint32_t> EncodeType(const DataType& type);
+ /// \brief Lookup the anchor for a given type alias
+ ///
+ /// Similar to \see EncodeType but this is used for cases where the data
type is either
+ /// parameterized or custom in some way (e.g. we use this for
Time64::Nanos). We need
+ /// to use the Id directly since we can't have registered the type with the
registry.
+ Result<uint32_t> EncodeTypeId(Id type_id);
Review Comment:
AFAICT this isn't used anywhere. If let's wait to add it until it's necessary
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]