jmestwa-coder commented on code in PR #50025:
URL: https://github.com/apache/arrow/pull/50025#discussion_r3422000122


##########
cpp/src/parquet/types.cc:
##########
@@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {
 
 std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
                             const std::shared_ptr<const LogicalType>& 
logical_type) {
+  // Statistics blobs come from the file's Thrift-encoded metadata and may have
+  // arbitrary length under a malicious writer. Reject any value that is 
shorter
+  // than what the physical (and, for FLOAT16, logical) type requires so the
+  // memcpy/byte loads below cannot run past the buffer.
+  size_t required = 0;
+  switch (parquet_type) {
+    case Type::BOOLEAN:
+      required = sizeof(bool);
+      break;
+    case Type::INT32:
+    case Type::FLOAT:
+      required = 4;
+      break;
+    case Type::INT64:
+    case Type::DOUBLE:
+      required = 8;
+      break;
+    case Type::INT96:
+      required = 3 * sizeof(int32_t);
+      break;
+    case Type::BYTE_ARRAY:
+    case Type::FIXED_LEN_BYTE_ARRAY:
+      if (logical_type != nullptr && logical_type->is_float16()) {
+        required = 2;
+      }
+      break;
+    case Type::UNDEFINED:
+    default:
+      break;
+  }
+  if (val.size() < required) {

Review Comment:
   switched to clamping each fixed-width read to val.size() rather than the 
required-length table, so every over-reading path is covered without touching 
the public signature, and there's no throw since truncated stats are valid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to