This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 4e22f4d9c0 GH-49274: [Doc][C++] Document security model for Arrow C++
(#49489)
4e22f4d9c0 is described below
commit 4e22f4d9c07a07da67b5271881635bd668263077
Author: Antoine Pitrou <[email protected]>
AuthorDate: Mon Mar 30 11:38:54 2026 +0200
GH-49274: [Doc][C++] Document security model for Arrow C++ (#49489)
### Rationale for this change
Now that we have a general security model for the Arrow specs, add a
security model for Arrow C++ specifically, meant to describe appropriate API
usage.
### Are these changes tested?
N/A.
### Are there any user-facing changes?
No.
* GitHub Issue: #49274
Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
---
cpp/src/arrow/array/data.h | 2 +
docs/source/cpp/api/array.rst | 15 +++-
docs/source/cpp/api/builder.rst | 4 ++
docs/source/cpp/api/memory.rst | 12 ++++
docs/source/cpp/conventions.rst | 7 +-
docs/source/cpp/csv.rst | 2 +
docs/source/cpp/ipc.rst | 2 +
docs/source/cpp/parquet.rst | 2 +
docs/source/cpp/security.rst | 152 ++++++++++++++++++++++++++++++++++++++++
docs/source/cpp/user_guide.rst | 1 +
docs/source/format/Security.rst | 11 ++-
11 files changed, 204 insertions(+), 6 deletions(-)
diff --git a/cpp/src/arrow/array/data.h b/cpp/src/arrow/array/data.h
index 52d303f5c6..92308e8a01 100644
--- a/cpp/src/arrow/array/data.h
+++ b/cpp/src/arrow/array/data.h
@@ -481,6 +481,7 @@ struct ARROW_EXPORT ArrayData {
std::shared_ptr<ArrayStatistics> statistics;
};
+/// \class BufferSpan
/// \brief A non-owning Buffer reference
struct ARROW_EXPORT BufferSpan {
// It is the user of this class's responsibility to ensure that
@@ -501,6 +502,7 @@ struct ARROW_EXPORT BufferSpan {
}
};
+/// \class ArraySpan
/// \brief EXPERIMENTAL: A non-owning array data container
///
/// Unlike ArrayData, this class doesn't own its referenced data type nor data
buffers.
diff --git a/docs/source/cpp/api/array.rst b/docs/source/cpp/api/array.rst
index 91aa5da673..7c13575b41 100644
--- a/docs/source/cpp/api/array.rst
+++ b/docs/source/cpp/api/array.rst
@@ -92,8 +92,8 @@ Extension arrays
.. doxygenclass:: arrow::ExtensionArray
:members:
-Run-End Encoded Array
----------------------
+Run-end encoded
+---------------
.. doxygenclass:: arrow::RunEndEncodedArray
:members:
@@ -116,6 +116,17 @@ Chunked Arrays
:project: arrow_cpp
:members:
+Non-owning data class
+=====================
+
+.. warning::
+ As this class doesn't keep alive the objects and data it points to, their
+ lifetime must be ensured separately. We recommend using
:class:`arrow::ArrayData`
+ instead.
+
+.. doxygenclass:: arrow::ArraySpan
+ :members:
+
Utilities
=========
diff --git a/docs/source/cpp/api/builder.rst b/docs/source/cpp/api/builder.rst
index 1342ba2655..d6532b03a2 100644
--- a/docs/source/cpp/api/builder.rst
+++ b/docs/source/cpp/api/builder.rst
@@ -15,10 +15,14 @@
.. specific language governing permissions and limitations
.. under the License.
+.. _cpp-api-array-builders:
+
==============
Array Builders
==============
+.. seealso:: :ref:`cpp-api-buffer-builders` for direct construction of array
buffers
+
.. doxygenclass:: arrow::ArrayBuilder
:members:
diff --git a/docs/source/cpp/api/memory.rst b/docs/source/cpp/api/memory.rst
index 9d12e4bdf0..49d30f683f 100644
--- a/docs/source/cpp/api/memory.rst
+++ b/docs/source/cpp/api/memory.rst
@@ -55,6 +55,16 @@ Buffers
.. doxygenclass:: arrow::ResizableBuffer
:members:
+Non-owning Buffer
+-----------------
+
+.. warning::
+ This class is exposed solely as a building block for
:class:`arrow::ArraySpan`.
+ For any other purpose, please use :class:`arrow::Buffer`.
+
+.. doxygenclass:: arrow::BufferSpan
+ :members:
+
Memory Pools
------------
@@ -91,6 +101,8 @@ Slicing
.. doxygengroup:: buffer-slicing-functions
:content-only:
+.. _cpp-api-buffer-builders:
+
Buffer Builders
---------------
diff --git a/docs/source/cpp/conventions.rst b/docs/source/cpp/conventions.rst
index 8ea625c0b8..e82fd3ecb1 100644
--- a/docs/source/cpp/conventions.rst
+++ b/docs/source/cpp/conventions.rst
@@ -20,6 +20,8 @@
.. cpp:namespace:: arrow
+.. _cpp-conventions:
+
Conventions
===========
@@ -43,6 +45,10 @@ Safe pointers
Arrow objects are usually passed and stored using safe pointers -- most of
the time :class:`std::shared_ptr` but sometimes also :class:`std::unique_ptr`.
+Non-owning alternatives exist for the rare situations where the overhead of
+a safe pointer is considered unacceptable: :class:`ArraySpan` and
:class:`BufferSpan`.
+Their usage in third-party code is not recommended.
+
Immutability
------------
@@ -104,4 +110,3 @@ For example::
.. seealso::
:doc:`API reference for error reporting <api/support>`
-
diff --git a/docs/source/cpp/csv.rst b/docs/source/cpp/csv.rst
index bcb17bdc58..74ee0bb4fb 100644
--- a/docs/source/cpp/csv.rst
+++ b/docs/source/cpp/csv.rst
@@ -30,6 +30,8 @@ to create Arrow Tables or a stream of Arrow RecordBatches.
.. seealso::
:ref:`CSV reader/writer API reference <cpp-api-csv>`.
+.. _cpp-csv-reading:
+
Reading CSV files
=================
diff --git a/docs/source/cpp/ipc.rst b/docs/source/cpp/ipc.rst
index ce4175bca0..14ae060e5e 100644
--- a/docs/source/cpp/ipc.rst
+++ b/docs/source/cpp/ipc.rst
@@ -33,6 +33,8 @@ lower level input/output, handled through the :doc:`IO
interfaces <io>`.
For reading, there is also an event-driven API that enables feeding
arbitrary data into the IPC decoding layer asynchronously.
+.. _cpp-ipc-reading:
+
Reading IPC streams and files
=============================
diff --git a/docs/source/cpp/parquet.rst b/docs/source/cpp/parquet.rst
index 8c55ec5d53..045f7f80f6 100644
--- a/docs/source/cpp/parquet.rst
+++ b/docs/source/cpp/parquet.rst
@@ -32,6 +32,8 @@ is a space-efficient columnar storage format for complex
data. The Parquet
C++ implementation is part of the Apache Arrow project and benefits
from tight integration with the Arrow C++ classes and facilities.
+.. _cpp-parquet-reading:
+
Reading Parquet files
=====================
diff --git a/docs/source/cpp/security.rst b/docs/source/cpp/security.rst
new file mode 100644
index 0000000000..ee35f7b296
--- /dev/null
+++ b/docs/source/cpp/security.rst
@@ -0,0 +1,152 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+
+.. _cpp-security:
+
+=======================
+Security Considerations
+=======================
+
+.. important::
+ This document describes the security model for using the Arrow C++ APIs.
+ For better understanding of this document, we recommend that you first read
+ the :ref:`overall security model <format_security>` for the Arrow project.
+
+API parameter validity
+======================
+
+Many Arrow C++ APIs report errors using the :class:`arrow::Status` and
+:class:`arrow::Result` types. Such APIs can be assumed to detect common errors
+in the provided arguments. However, there are also often implicit
pre-conditions
+that have to be upheld; these can usually be deduced from the semantics of an
+API as described by its documentation.
+
+.. seealso:: Arrow C++ :ref:`cpp-conventions`
+
+Pointer validity
+----------------
+
+Pointers are always assumed to be valid and point to memory of the size
required
+by the API. In particular, it is *forbidden to pass a null pointer* except
where
+the API documentation explicitly says otherwise.
+
+Type restrictions
+-----------------
+
+Some APIs are specified to operate on specific Arrow data types and may not
+verify that their arguments conform to the expected data types. Passing the
+wrong kind of data as input may lead to undefined behavior.
+
+.. _cpp-valid-data:
+
+Data validity
+-------------
+
+Arrow data, for example passed as :class:`arrow::Array` or
:class:`arrow::Table`,
+is always assumed to be :ref:`valid <format-invalid-data>`. If your program may
+encounter invalid data, it must explicitly check its validity by calling one of
+the following validation APIs.
+
+Structural validity
+'''''''''''''''''''
+
+The ``Validate`` methods exposed on various Arrow C++ classes perform
relatively
+inexpensive validity checks that the data is structurally valid. This implies
+checking the number of buffers, child arrays, and other similar conditions.
+
+* :func:`arrow::Array::Validate`
+* :func:`arrow::RecordBatch::Validate`
+* :func:`arrow::ChunkedArray::Validate`
+* :func:`arrow::Table::Validate`
+* :func:`arrow::Scalar::Validate`
+
+These checks typically are constant-time against the number of rows in the
data,
+but linear in the number of descendant fields. They can be good enough to
detect
+potential bugs in your own code. However, they are not enough to detect all
classes of
+invalid data, and they won't protect against all kinds of malicious payloads.
+
+Full validity
+'''''''''''''
+
+The ``ValidateFull`` methods exposed by the same classes perform the same
validity
+checks as the ``Validate`` methods, but they also check the data extensively
for
+any non-conformance to the Arrow spec. In particular, they check all the
offsets
+of variable-length data types, which is of fundamental importance when
ingesting
+untrusted data from sources such as the IPC format (otherwise the
variable-length
+offsets could point outside of the corresponding data buffer). They also check
+for invalid values, such as invalid UTF-8 strings or decimal values out of
range
+for the advertised precision.
+
+* :func:`arrow::Array::ValidateFull`
+* :func:`arrow::RecordBatch::ValidateFull`
+* :func:`arrow::ChunkedArray::ValidateFull`
+* :func:`arrow::Table::ValidateFull`
+* :func:`arrow::Scalar::ValidateFull`
+
+"Safe" and "unsafe" APIs
+------------------------
+
+Some APIs are exposed in both "safe" and "unsafe" variants. The naming
convention
+for such pairs varies: sometimes the former has a ``Safe`` suffix (for example
+``SliceSafe`` vs. ``Slice``), sometimes the latter has an ``Unsafe`` prefix or
+suffix (for example ``Append`` vs. ``UnsafeAppend``).
+
+In all cases, the "unsafe" API is intended as a more efficient API that
+eschews some of the checks that the "safe" API performs. It is then up to the
+caller to ensure that the preconditions are met, otherwise undefined behavior
+may ensue.
+
+The API documentation usually spells out the differences between "safe" and
"unsafe"
+variants, but these typically fall into two categories:
+
+* structural checks, such as passing the right Arrow data type or numbers of
buffers;
+* allocation size checks, such as having preallocated enough data for the
given input
+ arguments (this is typical of the :ref:`array builders
<cpp-api-array-builders>`
+ and :ref:`buffer builders <cpp-api-buffer-builders>`).
+
+Ingesting untrusted data
+========================
+
+As an exception to the above (see :ref:`cpp-valid-data`), some APIs support
ingesting
+untrusted, potentially malicious data. These are:
+
+* the :ref:`IPC reader <cpp-ipc-reading>` APIs
+* the :ref:`Parquet reader <cpp-parquet-reading>` APIs
+* the :ref:`CSV reader <cpp-csv-reading>` APIs
+
+IPC and Parquet readers
+-----------------------
+
+You must not assume that these will always return valid Arrow data. The reason
+for not validating data automatically is that validation can be expensive but
+unnecessary when reading from trusted data sources.
+
+Instead, when using these APIs with potentially invalid data (such as data
coming
+from an untrusted source), you **must** follow these steps:
+
+1. Check any error returned by the API, as with any other API
+2. If the API returned successfully, validate the returned Arrow data in full
+ (see "Full validity" above)
+
+CSV reader
+----------
+
+With the default :class:`conversion options <arrow::csv::ConvertOptions>`,
+the CSV reader will either return valid Arrow data or error out. Some options,
+however, allow relaxing the corresponding checks in favor of performance.
diff --git a/docs/source/cpp/user_guide.rst b/docs/source/cpp/user_guide.rst
index 094859f9c5..722e9e50af 100644
--- a/docs/source/cpp/user_guide.rst
+++ b/docs/source/cpp/user_guide.rst
@@ -39,6 +39,7 @@ User Guide
json
dataset
flight
+ security
gdb
threading
opentelemetry
diff --git a/docs/source/format/Security.rst b/docs/source/format/Security.rst
index e14f07143c..8e630ea9a5 100644
--- a/docs/source/format/Security.rst
+++ b/docs/source/format/Security.rst
@@ -26,10 +26,15 @@ data from untrusted sources. It focuses specifically on
data passed in a
standardized serialized form (such as a IPC stream), as opposed to an
implementation-specific native representation (such as ``arrow::Array`` in
C++).
-.. note::
+.. important::
Implementation-specific concerns, such as bad API usage, are out of scope
for this document. Please refer to the implementation's own documentation.
+.. seealso::
+
+ Arrow C++ :ref:`cpp-security`
+ Security model for Arrow C++ APIs
+
Who should read this
====================
@@ -49,6 +54,8 @@ You should read this document if you belong to either of
these two categories:
Columnar Format
===============
+.. _format-invalid-data:
+
Invalid data
------------
@@ -89,8 +96,6 @@ explicitly validates any Arrow data it receives under
serialized form
from untrusted sources. Many Arrow implementations provide explicit APIs to
perform such validation.
-.. TODO: link to some validation APIs for the main implementations here?
-
Advice for implementors
'''''''''''''''''''''''