OwenSanzas opened a new issue, #50231:
URL: https://github.com/apache/arrow/issues/50231

   
   ## Summary
   
   A 2-byte Substrait protobuf (`12 00`) fed to the public
   `arrow::engine::DeserializePlan` API aborts the process. A
   `SimpleExtensionDeclaration` whose `mapping_type` oneof is left unset falls
   through to `default: Unreachable()` in `GetExtensionSetFromMessage`, which 
is an
   unconditional `std::abort()`. The same root cause is reachable through
   `arrow::engine::DeserializeExpressions` (via the `ExtendedExpression` 
entry). Any
   service that deserializes an untrusted Substrait plan/expression can be 
crashed
   (denial of service) by a trivially malformed input.
   
   Tested at pinned commit `16fe34250a2ef261790b9cc414fdf0831669cf9f`
   (25.0.0-SNAPSHOT).
   
   ## Root Cause
   
   `GetExtensionSetFromMessage<T>` switches on the
   `SimpleExtensionDeclaration.mapping_type_case()` oneof selector to pick which
   extension kind the declaration carries. It handles the three defined arms
   (`kExtensionTypeVariation`, `kExtensionType`, `kExtensionFunction`) and then
   falls into a `default:` arm that calls `arrow::Unreachable()`. But the oneof 
also
   has the *unset* state `MAPPING_TYPE_NOT_SET` (the protobuf default when a 
producer
   simply omits the field), which is attacker-reachable and lands on exactly 
that
   `default:` arm. `Unreachable()` is a hard `std::abort()` — the `DCHECK` is a
   no-op under `NDEBUG` (release builds), so the only effect on untrusted input 
is a
   process abort instead of a non-OK `Status`.
   
   Vulnerable code (`cpp/src/arrow/engine/substrait/util_internal.h:65-86`):
   
   ```cpp
     std::unordered_map<uint32_t, Id> type_ids, function_ids;
     for (const auto& ext : message.extensions()) {
       switch (ext.mapping_type_case()) {
         case 
substrait::extensions::SimpleExtensionDeclaration::kExtensionTypeVariation: {
           return Status::NotImplemented("Type Variations are not yet 
implemented");
         }
         case 
substrait::extensions::SimpleExtensionDeclaration::kExtensionType: {
           const auto& type = ext.extension_type();
           std::string_view uri = uris[type.extension_uri_reference()];
           type_ids[type.type_anchor()] = Id{uri, type.name()};
           break;
         }
         case 
substrait::extensions::SimpleExtensionDeclaration::kExtensionFunction: {
           const auto& fn = ext.extension_function();
           std::string_view uri = uris[fn.extension_uri_reference()];
           function_ids[fn.function_anchor()] = Id{uri, fn.name()};
           break;
         }
         default:
           Unreachable();   // <-- MAPPING_TYPE_NOT_SET reaches here -> 
std::abort()
       }
     }
   ```
   
   `Unreachable()` is an unconditional abort 
(`cpp/src/arrow/util/unreachable.cc:26-29`):
   
   ```cpp
   [[noreturn]] void Unreachable(const char* message) {
     DCHECK(false) << message;   // no-op under NDEBUG
     std::abort();
   }
   ```
   
   Two public entries reach the same site:
   
   - `DeserializePlan` (`serde.cc:217`) -> `GetExtensionSetFromPlan`
     (`plan_internal.cc:53`) -> `GetExtensionSetFromMessage<substrait::Plan>`.
   - `DeserializeExpressions` (`serde.cc:251`) -> `FromProto`
     (`extended_expression_internal.cc:163`) -> 
`GetExtensionSetFromExtendedExpression`
     (`extended_expression_internal.cc:38`) ->
     `GetExtensionSetFromMessage<substrait::ExtendedExpression>`.
   
   ## PoC
   
   2 bytes — a `Plan` with one empty `SimpleExtensionDeclaration` (field 2,
   length-delimited, zero length). The empty declaration leaves `mapping_type`
   unset, hitting `default: Unreachable()`.
   
   ```
   12 00
   ```
   
   ```python
   # generate_poc.py — re-create the crash input from bytes
   poc = bytes([0x12, 0x00])
   open("poc.bin", "wb").write(poc)
   ```
   
   Crash input size: 2 bytes (`poc/poc.bin`, md5 
`5a68de997d60afa9083b17fe00f7cdf2`).
   The same root cause is also reachable via `DeserializeExpressions` with an
   `ExtendedExpression` carrying an empty `SimpleExtensionDeclaration`.
   
   ## Reproduction
   
   Build Arrow C++ from source with `-DARROW_SUBSTRAIT=ON` and 
AddressSanitizer, then deserialize the
   attached 2-byte Substrait plan through the public API:
   
   ```cpp
   #include <arrow/engine/substrait/serde.h>
   // buf = {0x12, 0x00};
   auto result = arrow::engine::DeserializePlan(*buf);   // std::abort() here
   ```
   
   A `SimpleExtensionDeclaration` with an unset `mapping_type` oneof (the 
protobuf default) falls through
   `switch (ext.mapping_type_case()) { ... default: Unreachable(); }`, and 
`arrow::Unreachable()` calls
   `std::abort()` (silent SIGABRT, exit 134):
   
   ```
   GetExtensionSetFromMessage<...>  util_internal.h
   DeserializePlan                  serde.cc
   ```
   
   The `default: Unreachable()` arm is still present in current `master`
   (`cpp/src/arrow/engine/substrait/util_internal.h:82`). PoC: 2 bytes `12 00`.
   
   ## Suggested Fix
   
   Replace the `default: Unreachable()` arm with a returned non-OK `Status` so 
the
   unset/unknown oneof is rejected gracefully instead of aborting:
   
   ```diff
   --- a/cpp/src/arrow/engine/substrait/util_internal.h
   +++ b/cpp/src/arrow/engine/substrait/util_internal.h
   @@
   -      default:
   -        Unreachable();
   +      default:
   +        return Status::Invalid(
   +            "Invalid Substrait SimpleExtensionDeclaration: mapping_type is 
unset "
   +            "or unknown (case ",
   +            static_cast<int>(ext.mapping_type_case()), ")");
        }
   ```
   
   `GetExtensionSetFromMessage` already returns `Result<ExtensionSet>`, so the
   `Status::Invalid` propagates cleanly back through `DeserializePlan` /
   `DeserializeExpressions` as a non-OK result.
   
   ## PoC bytes (self-contained)
   
   The trigger input is **2 bytes** (`poc/poc.bin`).
   Recreate it exactly with:
   
   ```bash
   base64 -d > poc.bin <<'B64'
   EgA=
   B64
   ```
   
   Hex: `1200`
   
   ## Credit
   
   Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra, 
Guido Vranken).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to