pitrou commented on code in PR #48870: URL: https://github.com/apache/arrow/pull/48870#discussion_r2732134892
########## docs/source/format/Security.rst: ########## @@ -0,0 +1,251 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _format_security: + +*********************** +Security Considerations +*********************** + +This document describes potential security concerns with the various Arrow +specifications in contexts where data cannot be fully trusted. + + +Who should read this +==================== + +This document targets two categories of readers: + +1. *implementors* of Arrow libraries: that is, libraries that provide APIs + abstraction away from the details of the Arrow formats and protocols; such + libraries include the official Arrow implementations documented on + https://arrow.apache.org, but not only. + +2. *users* of Arrow: that is, developers of third-party libraries or applications + that use some of the Arrow formats or protocols by calling into Arrow libraries + as defined above. + + +Columnar Format +=============== + +Invalid data +------------ + +The Arrow :ref:`columnar format <_format_columnar>` is an efficient binary +representation with a focus on performance and efficiency. While the format +does not store raw pointers, the contents of Arrow buffers are often +combined and converted to pointers into the process' address space. +Invalid Arrow data may therefore cause invalid memory accesses +(potentially crashing the process) or access to non-Arrow data +(potentially allowing an attacker to exfiltrate confidential information). + +For instance, to read a value from a Binary array, you need to 1) read the +values' offsets from array buffer #2, and 2) read the range of bytes +delimited by these offsets in array buffer #3. If the offsets are invalid +(deliberately or not), then step 2) can access memory outside of the buffers' +range. Review Comment: Why not, but this is just a simple example. I'll try to avoid the mixup of buffer numbers and step numbers, though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
