rootvector2 opened a new pull request, #50013:
URL: https://github.com/apache/arrow/pull/50013

   ### What
   
   When reading a Parquet file, `ApplicationVersion::ApplicationVersion(const 
std::string& created_by)` parses the writer-supplied `created_by` metadata 
string into three integer fields (`version.major`, `version.minor`, 
`version.patch`). The current implementation in `cpp/src/parquet/metadata.cc` 
does this with `std::atoi`:
   
   ```cpp
   application_version_.version.major = atoi(version_major_string.c_str());  // 
line 1444
   application_version_.version.minor = atoi(version_minor_string.c_str());  // 
line 1469
   application_version_.version.patch = atoi(version_patch_string.c_str());  // 
line 1487
   ```
   
   The substrings here are already filtered to be digit-only, but their length 
is not bounded. The `created_by` field is fully attacker-controlled in any 
untrusted Parquet file, so an input like
   
   ```
   parquet-mr version 99999999999999999999.0.0 (build abcd)
   ```
   
   reaches `atoi` with a value far outside the range of `int`. Per the C++ 
standard, `std::atoi` has **undefined behavior** when the result cannot be 
represented as an `int` (`[c.strtoint]`). In practice the value is unspecified, 
and the negative or otherwise garbage value that comes out then drives 
`VersionLt` / `HasCorrectStatistics`, controlling whether the reader chooses to 
trust the column statistics block of the file.
   
   ### How to observe
   
   A small reproducer in the same C++17 mode as Arrow:
   
   ```cpp
   #include <cstdio>
   #include <cstdlib>
   #include <string>
   int main() {
       std::string s = "99999999999999999999";
       std::printf("atoi -> %d\n", std::atoi(s.c_str())); // UB; observed -1 
here
   }
   ```
   
   UBSan flags the overflow inside the underlying `strtol`. The existing tests 
in `metadata_test.cc` only cover small inputs, which is why this has not been 
caught.
   
   ### The fix
   
   Replace the three `atoi` call sites with a small 
`ParseUnsignedVersionComponent` helper inside the anonymous namespace in 
`metadata.cc`. The helper uses `std::strtoul` (which is defined to set `errno = 
ERANGE` and return `ULONG_MAX` on overflow rather than invoking UB) and 
saturates to `std::numeric_limits<int>::max()` when the parsed value would not 
fit in `int`. The fix is local to the callee – callers (the constructor of 
`ApplicationVersion`, called from `FileMetaData`'s thrift deserializer) are 
unchanged.
   
   A new `ApplicationVersion.VersionComponentOverflow` test exercises:
   
   * very long all-digit components (saturate to `INT_MAX`),
   * exactly `INT_MAX` (representable),
   * `INT_MAX + 1` (saturates).
   
   ### Build/test evidence
   
   Built `parquet_objlib` and `parquet-internals-test` locally; all 18 
`ApplicationVersion.*` tests pass, including the new overflow case:
   
   ```
   [----------] 18 tests from ApplicationVersion (0 ms total)
   [  PASSED  ] 18 tests.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to