OwenSanzas opened a new issue, #50231:
URL: https://github.com/apache/arrow/issues/50231
## Summary
A 2-byte Substrait protobuf (`12 00`) fed to the public
`arrow::engine::DeserializePlan` API aborts the process. A
`SimpleExtensionDeclaration` whose `mapping_type` oneof is left unset falls
through to `default: Unreachable()` in `GetExtensionSetFromMessage`, which
is an
unconditional `std::abort()`. The same root cause is reachable through
`arrow::engine::DeserializeExpressions` (via the `ExtendedExpression`
entry). Any
service that deserializes an untrusted Substrait plan/expression can be
crashed
(denial of service) by a trivially malformed input.
Tested at pinned commit `16fe34250a2ef261790b9cc414fdf0831669cf9f`
(25.0.0-SNAPSHOT).
## Root Cause
`GetExtensionSetFromMessage<T>` switches on the
`SimpleExtensionDeclaration.mapping_type_case()` oneof selector to pick which
extension kind the declaration carries. It handles the three defined arms
(`kExtensionTypeVariation`, `kExtensionType`, `kExtensionFunction`) and then
falls into a `default:` arm that calls `arrow::Unreachable()`. But the oneof
also
has the *unset* state `MAPPING_TYPE_NOT_SET` (the protobuf default when a
producer
simply omits the field), which is attacker-reachable and lands on exactly
that
`default:` arm. `Unreachable()` is a hard `std::abort()` — the `DCHECK` is a
no-op under `NDEBUG` (release builds), so the only effect on untrusted input
is a
process abort instead of a non-OK `Status`.
Vulnerable code (`cpp/src/arrow/engine/substrait/util_internal.h:65-86`):
```cpp
std::unordered_map<uint32_t, Id> type_ids, function_ids;
for (const auto& ext : message.extensions()) {
switch (ext.mapping_type_case()) {
case
substrait::extensions::SimpleExtensionDeclaration::kExtensionTypeVariation: {
return Status::NotImplemented("Type Variations are not yet
implemented");
}
case
substrait::extensions::SimpleExtensionDeclaration::kExtensionType: {
const auto& type = ext.extension_type();
std::string_view uri = uris[type.extension_uri_reference()];
type_ids[type.type_anchor()] = Id{uri, type.name()};
break;
}
case
substrait::extensions::SimpleExtensionDeclaration::kExtensionFunction: {
const auto& fn = ext.extension_function();
std::string_view uri = uris[fn.extension_uri_reference()];
function_ids[fn.function_anchor()] = Id{uri, fn.name()};
break;
}
default:
Unreachable(); // <-- MAPPING_TYPE_NOT_SET reaches here ->
std::abort()
}
}
```
`Unreachable()` is an unconditional abort
(`cpp/src/arrow/util/unreachable.cc:26-29`):
```cpp
[[noreturn]] void Unreachable(const char* message) {
DCHECK(false) << message; // no-op under NDEBUG
std::abort();
}
```
Two public entries reach the same site:
- `DeserializePlan` (`serde.cc:217`) -> `GetExtensionSetFromPlan`
(`plan_internal.cc:53`) -> `GetExtensionSetFromMessage<substrait::Plan>`.
- `DeserializeExpressions` (`serde.cc:251`) -> `FromProto`
(`extended_expression_internal.cc:163`) ->
`GetExtensionSetFromExtendedExpression`
(`extended_expression_internal.cc:38`) ->
`GetExtensionSetFromMessage<substrait::ExtendedExpression>`.
## PoC
2 bytes — a `Plan` with one empty `SimpleExtensionDeclaration` (field 2,
length-delimited, zero length). The empty declaration leaves `mapping_type`
unset, hitting `default: Unreachable()`.
```
12 00
```
```python
# generate_poc.py — re-create the crash input from bytes
poc = bytes([0x12, 0x00])
open("poc.bin", "wb").write(poc)
```
Crash input size: 2 bytes (`poc/poc.bin`, md5
`5a68de997d60afa9083b17fe00f7cdf2`).
The same root cause is also reachable via `DeserializeExpressions` with an
`ExtendedExpression` carrying an empty `SimpleExtensionDeclaration`.
## Reproduction
Build Arrow C++ from source with `-DARROW_SUBSTRAIT=ON` and
AddressSanitizer, then deserialize the
attached 2-byte Substrait plan through the public API:
```cpp
#include <arrow/engine/substrait/serde.h>
// buf = {0x12, 0x00};
auto result = arrow::engine::DeserializePlan(*buf); // std::abort() here
```
A `SimpleExtensionDeclaration` with an unset `mapping_type` oneof (the
protobuf default) falls through
`switch (ext.mapping_type_case()) { ... default: Unreachable(); }`, and
`arrow::Unreachable()` calls
`std::abort()` (silent SIGABRT, exit 134):
```
GetExtensionSetFromMessage<...> util_internal.h
DeserializePlan serde.cc
```
The `default: Unreachable()` arm is still present in current `master`
(`cpp/src/arrow/engine/substrait/util_internal.h:82`). PoC: 2 bytes `12 00`.
## Suggested Fix
Replace the `default: Unreachable()` arm with a returned non-OK `Status` so
the
unset/unknown oneof is rejected gracefully instead of aborting:
```diff
--- a/cpp/src/arrow/engine/substrait/util_internal.h
+++ b/cpp/src/arrow/engine/substrait/util_internal.h
@@
- default:
- Unreachable();
+ default:
+ return Status::Invalid(
+ "Invalid Substrait SimpleExtensionDeclaration: mapping_type is
unset "
+ "or unknown (case ",
+ static_cast<int>(ext.mapping_type_case()), ")");
}
```
`GetExtensionSetFromMessage` already returns `Result<ExtensionSet>`, so the
`Status::Invalid` propagates cleanly back through `DeserializePlan` /
`DeserializeExpressions` as a non-OK result.
## PoC bytes (self-contained)
The trigger input is **2 bytes** (`poc/poc.bin`).
Recreate it exactly with:
```bash
base64 -d > poc.bin <<'B64'
EgA=
B64
```
Hex: `1200`
## Credit
Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra,
Guido Vranken).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]