rootvector2 opened a new pull request, #50013:
URL: https://github.com/apache/arrow/pull/50013
### What
When reading a Parquet file, `ApplicationVersion::ApplicationVersion(const
std::string& created_by)` parses the writer-supplied `created_by` metadata
string into three integer fields (`version.major`, `version.minor`,
`version.patch`). The current implementation in `cpp/src/parquet/metadata.cc`
does this with `std::atoi`:
```cpp
application_version_.version.major = atoi(version_major_string.c_str()); //
line 1444
application_version_.version.minor = atoi(version_minor_string.c_str()); //
line 1469
application_version_.version.patch = atoi(version_patch_string.c_str()); //
line 1487
```
The substrings here are already filtered to be digit-only, but their length
is not bounded. The `created_by` field is fully attacker-controlled in any
untrusted Parquet file, so an input like
```
parquet-mr version 99999999999999999999.0.0 (build abcd)
```
reaches `atoi` with a value far outside the range of `int`. Per the C++
standard, `std::atoi` has **undefined behavior** when the result cannot be
represented as an `int` (`[c.strtoint]`). In practice the value is unspecified,
and the negative or otherwise garbage value that comes out then drives
`VersionLt` / `HasCorrectStatistics`, controlling whether the reader chooses to
trust the column statistics block of the file.
### How to observe
A small reproducer in the same C++17 mode as Arrow:
```cpp
#include <cstdio>
#include <cstdlib>
#include <string>
int main() {
std::string s = "99999999999999999999";
std::printf("atoi -> %d\n", std::atoi(s.c_str())); // UB; observed -1
here
}
```
UBSan flags the overflow inside the underlying `strtol`. The existing tests
in `metadata_test.cc` only cover small inputs, which is why this has not been
caught.
### The fix
Replace the three `atoi` call sites with a small
`ParseUnsignedVersionComponent` helper inside the anonymous namespace in
`metadata.cc`. The helper uses `std::strtoul` (which is defined to set `errno =
ERANGE` and return `ULONG_MAX` on overflow rather than invoking UB) and
saturates to `std::numeric_limits<int>::max()` when the parsed value would not
fit in `int`. The fix is local to the callee – callers (the constructor of
`ApplicationVersion`, called from `FileMetaData`'s thrift deserializer) are
unchanged.
A new `ApplicationVersion.VersionComponentOverflow` test exercises:
* very long all-digit components (saturate to `INT_MAX`),
* exactly `INT_MAX` (representable),
* `INT_MAX + 1` (saturates).
### Build/test evidence
Built `parquet_objlib` and `parquet-internals-test` locally; all 18
`ApplicationVersion.*` tests pass, including the new overflow case:
```
[----------] 18 tests from ApplicationVersion (0 ms total)
[ PASSED ] 18 tests.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]