suaron opened a new issue, #49956:
URL: https://github.com/apache/arrow/issues/49956

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ## Summary
   
   `garrow_data_type_new_raw` in arrow-glib unsafely `static_pointer_cast`s any 
`arrow::ExtensionType` to `garrow::GExtensionType`. When the underlying type is 
a non-GLib-defined ExtensionType subclass (e.g. 
`arrow::extension::OpaqueType`), the cast succeeds at compile time but reads 
wrong memory at runtime, producing a non-pointer value as `garrow_data_type_`. 
The subsequent `g_object_ref` segfaults.
   
   The most common real-world trigger: **PostgreSQL `NUMERIC` columns read via 
the ADBC PostgreSQL driver from any arrow-glib language binding** (red-arrow / 
Ruby, R-arrow, C# arrow-glib, etc.).
   
   ## Related Issues
   
   - 
[`apache/arrow-adbc#1515`](https://github.com/apache/arrow-adbc/issues/1515) — 
closed in ADBC 0.10.0; introduced the `arrow.opaque` ExtensionType wrapper for 
NUMERIC in the PG driver. This is what triggers the GLib crash.
   - 
[`apache/arrow-adbc#1513`](https://github.com/apache/arrow-adbc/issues/1513) — 
open; optional NUMERIC → double conversion. Unrelated to the crash but in the 
same code path.
   - 
[`red-data-tools/activerecord-adbc-adapter#7`](https://github.com/red-data-tools/activerecord-adbc-adapter/issues/7)
 — downstream Ruby gem blocked on this issue for Postgres NUMERIC support.
   
   ## What happened?
   
   A trivial `SELECT 12.34::NUMERIC(10,2)` through the ADBC PostgreSQL driver 
via red-arrow segfaults at `Schema#fields`.
   
   ```
   gobject-introspection-4.3.6/loader.rb:722:
     [BUG] Segmentation fault at 0x00636972656d756e
   ruby 3.4.9 (2026-03-11 revision 76cca827ab) +PRISM [arm64-darwin25]
   
   -- Machine register context --
     x0:  0x00636972656d756e   <-- ASCII "numeric\0" little-endian
     x19: 0x00636972656d756e
     x22: 0x00636972656d756e
   
   -- C level backtrace --
   libgobject-2.0.dylib   g_object_ref +0x1c
   libarrow-glib.dylib    garrow_data_type_new_raw +0x12c
   libarrow-glib.dylib    garrow_field_new_raw
   libarrow-glib.dylib    garrow_schema_get_fields
   libffi.dylib           ffi_call_SYSV
   ```
   
   `0x00636972656d756e` decoded little-endian = bytes `6e 75 6d 65 72 69 63 00` 
= ASCII `"numeric\0"`. The bytes of the extension type name are being 
dereferenced as a `GObject *`.
   
   ## What did you expect to happen?
   
   Reading `Schema#fields` on a table containing a NUMERIC column should return 
a list of fields, with the NUMERIC field's data type wrapped in either:
   
   - A generic `GArrowExtensionDataType` (acceptable, no crash), or
   - A dedicated `GArrowOpaqueDataType` exposing `extension_name`, 
`storage_type`, `vendor_name`, `type_name` (ideal).
   
   Either way: **no segfault**.
   
   ## Minimal Reproducible Example
   
   ### Ruby (red-arrow + red-adbc) — reproduces the crash
   
   ```ruby
   require "adbc"
   
   db = ADBC::Database.open(driver: "adbc_driver_postgresql",
                            uri: "postgresql:///postgres")
   conn = db.connect
   conn.open_statement do |stmt|
     stmt.sql_query = "SELECT 12.34::NUMERIC(10,2) AS amount"
     reader, _ = stmt.execute
     table = reader.read_all
     table.schema.fields    # <-- segfault here
   end
   ```
   
   No table required; an inline NUMERIC literal is sufficient.
   
   ### Python (verifies what the driver returns, no crash)
   
   ```python
   import adbc_driver_postgresql.dbapi as adbc
   import pyarrow as pa
   
   con = adbc.connect("postgresql:///postgres")
   cur = con.cursor()
   cur.execute("SELECT 12.34::NUMERIC(10,2) AS amount")
   table = cur.fetch_arrow_table()
   print(table.schema.field(0).type)
   # extension<arrow.opaque[storage_type=string,
   #                       type_name=numeric,
   #                       vendor_name=PostgreSQL]>
   print(table.schema.field(0).metadata)
   # {b'ADBC:postgresql:typname': b'numeric'}
   print(table.to_pylist())
   # [{'amount': '12.34'}]
   ```
   
   The driver returns NUMERIC as `arrow::extension::OpaqueType` over 
`StringDataType` storage. Type id = 31 (`arrow::Type::EXTENSION`).
   
   ### Real-World Impact
   
   Any arrow-glib language binding (red-arrow, R-arrow, C#, etc.) reading 
PostgreSQL data via ADBC crashes on the first NUMERIC column. The crash happens 
inside read-only schema inspection — there is no way for downstream code to 
defend against it.
   
   Concrete impact: blocks **all** Postgres NUMERIC support in 
`red-data-tools/activerecord-adbc-adapter` (Rails ↔ ADBC integration). Any 
Rails app with a `:decimal` column on Postgres cannot use this adapter.
   
   The same bug applies to **any** canonical Arrow extension type constructed 
in C++:
   
   | Extension type | Defined in | Crashes via arrow-glib? |
   |---|---|---|
   | `arrow::extension::OpaqueType` | `arrow/extension/opaque.h` | yes |
   | `arrow::extension::UuidType` | `arrow/extension/uuid.h` | yes (untested 
but same code path) |
   | `arrow::extension::FixedShapeTensorType` | 
`arrow/extension/fixed_shape_tensor.h` | yes (same) |
   | `arrow::extension::Bool8Type`, JSON, etc. | various | yes (same) |
   | Glib-registered custom extension via 
`garrow_extension_data_type_registry_register` | user code | no — the only 
correctly handled case today |
   
   ## The Semantic Issue
   
   `c_glib/arrow-glib/basic-data-type.cpp` (line numbers from 
`apache-arrow-24.0.0`, identical on `main`):
   
   ```cpp
   GArrowDataType *
   garrow_data_type_new_raw(std::shared_ptr<arrow::DataType> *arrow_data_type)
   {
     GType type;
     GArrowDataType *data_type;
   
     switch ((*arrow_data_type)->id()) {
     /* ... non-extension cases ... */
     case arrow::Type::type::EXTENSION:
       {
         auto g_extension_data_type =
           std::static_pointer_cast<garrow::GExtensionType>(*arrow_data_type);
         if (g_extension_data_type) {
           auto garrow_data_type = g_extension_data_type->garrow_data_type();
           g_object_ref(garrow_data_type);     /* <-- segfault here */
           return GARROW_DATA_TYPE(garrow_data_type);
         }
       }
       type = GARROW_TYPE_EXTENSION_DATA_TYPE;
       break;
     /* ... */
     }
     data_type = GARROW_DATA_TYPE(g_object_new(type, "data-type", 
arrow_data_type, NULL));
     return data_type;
   }
   ```
   
   ### Why it crashes
   
   1. `arrow::Type::EXTENSION` is the type id for **any** subclass of 
`arrow::ExtensionType`, not just `garrow::GExtensionType`.
   2. `std::static_pointer_cast` performs **no runtime type check**. It blindly 
reinterprets the `shared_ptr` regardless of dynamic type.
   3. `garrow::GExtensionType` (defined in `basic-data-type.cpp`) layout: 
extends `arrow::ExtensionType` with one extra member `GArrowDataType 
*garrow_data_type_`.
   4. `arrow::extension::OpaqueType` (`arrow/extension/opaque.h`) layout: 
extends `arrow::ExtensionType` with `std::string type_name_` and `std::string 
vendor_name_`.
   5. When the dynamic type is `OpaqueType` but we cast to `GExtensionType` and 
call `garrow_data_type()`, we read bytes from the **first member of 
OpaqueType** (`type_name_`), not `garrow_data_type_`.
   6. `std::string` SSO stores short strings inline in the string object's 
buffer. `"numeric"` is 7 chars + NUL — fits in SSO. The first 8 bytes of the 
`std::string` SSO buffer contain the ASCII bytes `n u m e r i c \0`.
   7. Those bytes are read as a `GArrowDataType *` pointer = address 
`0x00636972656d756e`.
   8. `g_object_ref` dereferences → segfault.
   
   ### Why this hasn't been reported earlier
   
   The bug requires the intersection of:
   
   1. arrow-glib language bindings (Ruby/R/C#/etc., **not** pyarrow), AND
   2. data containing a canonical Arrow ExtensionType, AND
   3. a code path that reads the schema (`Schema#fields`, `Field#data_type`).
   
   Most ADBC-PG users today are Python — pyarrow handles ExtensionType 
natively, never invoking arrow-glib. The bug surfaces precisely when red-arrow 
/ R-arrow / etc. consume ADBC-PG output with NUMERIC columns. ADBC PG driver 
started wrapping NUMERIC in `arrow.opaque` only since 0.10.0 (2024) — 
relatively recent.
   
   ## Current Workaround
   
   For ADBC-PG users:
   
   1. Cast NUMERIC to a non-extension type in SQL: `SELECT col::DOUBLE 
PRECISION FROM ...` or `SELECT col::TEXT FROM ...`. Avoids the OpaqueType 
wrapper.
   2. Skip schema inspection on tables that may contain ExtensionType.
   3. Switch backend (DuckDB ADBC works correctly — returns native 
`Decimal128DataType`, no ExtensionType).
   
   None of these are acceptable defaults for a generic ORM adapter.
   
   ## Proposed Solution
   
   **Option 1 (Preferred):** Replace `static_pointer_cast` with 
`dynamic_pointer_cast` in the EXTENSION switch arm.
   
   ```cpp
   case arrow::Type::type::EXTENSION:
     {
       auto g_extension_data_type =
         std::dynamic_pointer_cast<garrow::GExtensionType>(*arrow_data_type);
       if (g_extension_data_type) {
         auto garrow_data_type = g_extension_data_type->garrow_data_type();
         g_object_ref(garrow_data_type);
         return GARROW_DATA_TYPE(garrow_data_type);
       }
     }
     type = GARROW_TYPE_EXTENSION_DATA_TYPE;
     break;
   ```
   
   Effect:
   
   - Glib-registered extension types: identical behavior (the `dynamic_cast` 
succeeds for `garrow::GExtensionType` instances).
   - Non-glib ExtensionTypes (Opaque/Uuid/FixedShapeTensor/etc.): no crash — 
falls through to `GARROW_TYPE_EXTENSION_DATA_TYPE` (generic wrapper). Caller 
can still read storage_type and field metadata.
   
   Cost: one RTTI lookup per Field at Schema construction. Negligible. RTTI is 
already enabled in Arrow C++ build.
   
   **Option 2 (Follow-up, not required for crash fix):** Add dedicated glib 
wrappers for canonical Arrow extension types.
   
   - `GArrowOpaqueDataType` exposing `extension_name`, `storage_type`, 
`type_name`, `vendor_name`.
   - `GArrowUuidDataType`, etc.
   
   This would let arrow-glib bindings consume canonical extensions with the 
same ergonomics as pyarrow.
   
   **Option 3 (Belt-and-braces):** Both. Option 1 to stop the crash, Option 2 
to expose canonical extensions properly.
   
   ## Environment
   
   - **OS:** macOS 15.3 (arm64, Darwin 25.3.0)
   - **arrow-glib:** 24.0.0 (Homebrew bottle)
   - **apache-arrow:** 24.0.0 (Homebrew, dependency of arrow-glib)
   - **glib:** 2.88.1
   - **adbc-driver-postgresql:** 1.11.0 (pip wheel)
   - **PostgreSQL:** 17.9 (Postgres.app)
   - **red-arrow:** 24.0.0
   - **red-adbc:** 1.11.0
   - **gobject-introspection:** 4.3.6
   - **Ruby:** 3.4.9
   
   Bug also reproduces against `apache/arrow` `main` branch — the offending 
code path is unchanged between 24.0.0 and current main (verified 2026-05-09).
   
   ### Additional Context
   
   - arrow-glib source 24.0.0 (offending function): 
https://github.com/apache/arrow/blob/apache-arrow-24.0.0/c_glib/arrow-glib/basic-data-type.cpp
   - arrow-glib source main: 
https://github.com/apache/arrow/blob/main/c_glib/arrow-glib/basic-data-type.cpp
   - Arrow Opaque canonical extension type spec: 
https://arrow.apache.org/docs/format/CanonicalExtensions.html#opaque
   - `arrow/extension/opaque.h`: 
https://github.com/apache/arrow/blob/main/cpp/src/arrow/extension/opaque.h
   
   ## Acknowledgment
   
   The narrow trigger surface (arrow-glib bindings × ADBC-PG NUMERIC × schema 
inspection) explains why this slipped through — the dominant ADBC user 
(pyarrow) never hits the arrow-glib path. But the underlying bug is general: 
arrow-glib's EXTENSION case mishandles **every** ExtensionType not derived from 
`garrow::GExtensionType`, including the official canonical extensions that 
Arrow C++ ships and that ADBC drivers are increasingly using to encode 
source-system type info. The fix is one token (`static` → `dynamic`); the 
follow-up of adding wrappers for canonical extensions can land separately.
   
   
   ### Component(s)
   
   GLib


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to