Dandandan opened a new issue, #22187:
URL: https://github.com/apache/datafusion/issues/22187
### Describe the bug
Setting any `datafusion.runtime.*` runtime variable to a string value that
ends in a non-ASCII (multi-byte UTF-8) character panics inside `core::str`
instead of returning a planner error. Found via libFuzzer.
### To Reproduce
```rust
use datafusion::prelude::SessionContext;
#[tokio::main]
async fn main() {
let ctx = SessionContext::new();
let _ = ctx.sql("SET datafusion.runtime.max_temp_directory_size =
'é'").await;
}
```
Panic:
```
thread 'main' panicked at .../core/src/str/mod.rs:846:21:
end byte index 1 is not a char boundary; it is inside 'é' (bytes 0..2 of
string)
```
All five reachable runtime config keys panic the same way (one for each
callsite below). The single byte `0xCD` (or any other UTF-8 continuation byte)
is enough to trigger it, but a real non-ASCII char like `é`, `Δ`, or `字` works
just as well.
| SET key | min repro value |
|---|---|
| `memory_limit` | `'é'` |
| `max_temp_directory_size` | `'é'` |
| `metadata_cache_limit` | `'é'` |
| `list_files_cache_limit` | `'é'` |
| `list_files_cache_ttl` | `'1mé'` |
### Expected behavior
A `plan_err!` (or similar `Err`) explaining the unsupported unit / malformed
value. The public SQL API should never panic on user-supplied SQL.
### Root cause
[`datafusion/core/src/execution/context/mod.rs`](https://github.com/apache/datafusion/blob/main/datafusion/core/src/execution/context/mod.rs)
— three callers of `split_at(s.len() - 1)` that assume the last byte is its
own char:
```rust
// line 1259 parse_memory_limit (deprecated but pub, still has the bug)
// line 1302 parse_capacity_limit
// line 1336 parse_duration
let (number, unit) = limit.split_at(limit.len() - 1);
```
`str::split_at` requires the index to fall on a UTF-8 char boundary,
otherwise it panics. For any string ending in a non-ASCII char, `limit.len() -
1` is in the middle of a multi-byte sequence.
`parse_capacity_limit` is reached from `set_runtime_variable` for
`memory_limit`, `max_temp_directory_size`, `metadata_cache_limit`,
`list_files_cache_limit`. `parse_duration` is reached for
`list_files_cache_ttl`.
### Fix
Either:
- Slice by `char` instead of bytes: e.g. `if !limit.is_ascii() { return
plan_err!(...) }` before the `split_at`, **or**
- Use `limit.char_indices().next_back().map(|(i, _)|
limit.split_at(i)).ok_or_else(|| ...)`, **or**
- Compare via `chars().last()` and slice safely.
The unit set is `{K, M, G, m, s}` — all ASCII — so rejecting non-ASCII
values up front is sufficient.
### Additional context
Discovered by a `cargo fuzz` target seeded with SQL from
`datafusion/sqllogictest/test_files/`. The fuzzer reached the panic within the
first INITED pass on the seeded corpus after a random mutation flipped a
trailing byte of a SET value to a UTF-8 continuation byte
(`String::from_utf8_lossy` in the fuzz harness replaced it with `U+FFFD`, a
3-byte char, which is still enough to reach the bug).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]