dmitry-chirkov-dremio commented on issue #49614:
URL: https://github.com/apache/arrow/issues/49614#issuecomment-4148638537

   **Risk analysis of changing base64_decode behavior**
   
   All callers of `arrow::util::base64_decode` in the C++ codebase consume 
machine-generated base64 that was previously produced by base64_encode - these 
are pure roundtrip scenarios that would be unaffected by stricter validation:
   
   - **parquet/arrow/schema.cc** - decodes an Arrow schema serialized into 
Parquet metadata via base64_encode
   - **parquet/encryption/key_toolkit_internal.cc** - decodes encrypted key 
material produced by base64_encode
   - **parquet/encryption/file_key_unwrapper.cc** - decodes a KEK ID produced 
by base64_encode
   - **arrow/flight/flight_test.cc** - decodes base64 credentials in a test
   
   The only caller that can receive arbitrary (potentially malformed) input is 
Gandiva's `gdv_fn_base64_decode_utf8` (that's the source of issue discovery), 
which exposes `base64_decode` as a user-facing SQL function (`unbase64`). 
Today, if a user passes malformed base64 to this function, they get silent 
partial results - which is arguably a correctness bug, not a feature.
   
   Changing `base64_decode` to return an error on invalid characters (or 
changing the signature to `Result<std::string>`) would require updating all 5 
call sites, but the actual behavioral change is limited to Gandiva's SQL 
function since the other callers never pass invalid input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to