wgtmac commented on code in PR #50025:
URL: https://github.com/apache/arrow/pull/50025#discussion_r3294219441
##########
cpp/src/parquet/types.cc:
##########
@@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {
std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
const std::shared_ptr<const LogicalType>&
logical_type) {
+ // Statistics blobs come from the file's Thrift-encoded metadata and may have
+ // arbitrary length under a malicious writer. Reject any value that is
shorter
+ // than what the physical (and, for FLOAT16, logical) type requires so the
+ // memcpy/byte loads below cannot run past the buffer.
+ size_t required = 0;
+ switch (parquet_type) {
+ case Type::BOOLEAN:
+ required = sizeof(bool);
+ break;
+ case Type::INT32:
+ case Type::FLOAT:
+ required = 4;
+ break;
+ case Type::INT64:
+ case Type::DOUBLE:
+ required = 8;
+ break;
+ case Type::INT96:
+ required = 3 * sizeof(int32_t);
+ break;
+ case Type::BYTE_ARRAY:
+ case Type::FIXED_LEN_BYTE_ARRAY:
+ if (logical_type != nullptr && logical_type->is_float16()) {
+ required = 2;
+ }
+ break;
+ case Type::UNDEFINED:
+ default:
+ break;
+ }
+ if (val.size() < required) {
Review Comment:
I'm a little hesitant to complicate `FormatStatValue` by introducing this
check. `FormatStatValue` is usually called in the printer when we want to print
some information of a parquet file for debugging purpuse.
##########
cpp/src/parquet/types.cc:
##########
@@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {
std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
const std::shared_ptr<const LogicalType>&
logical_type) {
+ // Statistics blobs come from the file's Thrift-encoded metadata and may have
+ // arbitrary length under a malicious writer. Reject any value that is
shorter
+ // than what the physical (and, for FLOAT16, logical) type requires so the
+ // memcpy/byte loads below cannot run past the buffer.
+ size_t required = 0;
+ switch (parquet_type) {
+ case Type::BOOLEAN:
+ required = sizeof(bool);
+ break;
+ case Type::INT32:
+ case Type::FLOAT:
+ required = 4;
+ break;
+ case Type::INT64:
+ case Type::DOUBLE:
+ required = 8;
+ break;
+ case Type::INT96:
+ required = 3 * sizeof(int32_t);
+ break;
+ case Type::BYTE_ARRAY:
Review Comment:
It is better to put `BYTE_ARRAY` and `UNDEFINED` together since they are
skipped. In case you don't know, `BYTE_ARRAY` cannot have `float16` logical
type.
##########
cpp/src/parquet/types.cc:
##########
@@ -184,6 +184,43 @@ std::string FormatFloat16Value(::std::string_view val) {
std::string FormatStatValue(Type::type parquet_type, ::std::string_view val,
const std::shared_ptr<const LogicalType>&
logical_type) {
+ // Statistics blobs come from the file's Thrift-encoded metadata and may have
+ // arbitrary length under a malicious writer. Reject any value that is
shorter
+ // than what the physical (and, for FLOAT16, logical) type requires so the
+ // memcpy/byte loads below cannot run past the buffer.
+ size_t required = 0;
+ switch (parquet_type) {
+ case Type::BOOLEAN:
+ required = sizeof(bool);
+ break;
+ case Type::INT32:
+ case Type::FLOAT:
+ required = 4;
+ break;
+ case Type::INT64:
+ case Type::DOUBLE:
+ required = 8;
+ break;
+ case Type::INT96:
+ required = 3 * sizeof(int32_t);
+ break;
+ case Type::BYTE_ARRAY:
+ case Type::FIXED_LEN_BYTE_ARRAY:
+ if (logical_type != nullptr && logical_type->is_float16()) {
Review Comment:
Why not dealing with the length of `FIXED_LEN_BYTE_ARRAY`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]