This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 4e22f4d9c0 GH-49274: [Doc][C++] Document security model for Arrow C++ 
(#49489)
4e22f4d9c0 is described below

commit 4e22f4d9c07a07da67b5271881635bd668263077
Author: Antoine Pitrou <[email protected]>
AuthorDate: Mon Mar 30 11:38:54 2026 +0200

    GH-49274: [Doc][C++] Document security model for Arrow C++ (#49489)
    
    ### Rationale for this change
    
    Now that we have a general security model for the Arrow specs, add a 
security model for Arrow C++ specifically, meant to describe appropriate API 
usage.
    
    ### Are these changes tested?
    
    N/A.
    
    ### Are there any user-facing changes?
    
    No.
    
    * GitHub Issue: #49274
    
    Authored-by: Antoine Pitrou <[email protected]>
    Signed-off-by: Antoine Pitrou <[email protected]>
---
 cpp/src/arrow/array/data.h      |   2 +
 docs/source/cpp/api/array.rst   |  15 +++-
 docs/source/cpp/api/builder.rst |   4 ++
 docs/source/cpp/api/memory.rst  |  12 ++++
 docs/source/cpp/conventions.rst |   7 +-
 docs/source/cpp/csv.rst         |   2 +
 docs/source/cpp/ipc.rst         |   2 +
 docs/source/cpp/parquet.rst     |   2 +
 docs/source/cpp/security.rst    | 152 ++++++++++++++++++++++++++++++++++++++++
 docs/source/cpp/user_guide.rst  |   1 +
 docs/source/format/Security.rst |  11 ++-
 11 files changed, 204 insertions(+), 6 deletions(-)

diff --git a/cpp/src/arrow/array/data.h b/cpp/src/arrow/array/data.h
index 52d303f5c6..92308e8a01 100644
--- a/cpp/src/arrow/array/data.h
+++ b/cpp/src/arrow/array/data.h
@@ -481,6 +481,7 @@ struct ARROW_EXPORT ArrayData {
   std::shared_ptr<ArrayStatistics> statistics;
 };
 
+/// \class BufferSpan
 /// \brief A non-owning Buffer reference
 struct ARROW_EXPORT BufferSpan {
   // It is the user of this class's responsibility to ensure that
@@ -501,6 +502,7 @@ struct ARROW_EXPORT BufferSpan {
   }
 };
 
+/// \class ArraySpan
 /// \brief EXPERIMENTAL: A non-owning array data container
 ///
 /// Unlike ArrayData, this class doesn't own its referenced data type nor data 
buffers.
diff --git a/docs/source/cpp/api/array.rst b/docs/source/cpp/api/array.rst
index 91aa5da673..7c13575b41 100644
--- a/docs/source/cpp/api/array.rst
+++ b/docs/source/cpp/api/array.rst
@@ -92,8 +92,8 @@ Extension arrays
 .. doxygenclass:: arrow::ExtensionArray
    :members:
 
-Run-End Encoded Array
----------------------
+Run-end encoded
+---------------
 
 .. doxygenclass:: arrow::RunEndEncodedArray
    :members:
@@ -116,6 +116,17 @@ Chunked Arrays
    :project: arrow_cpp
    :members:
 
+Non-owning data class
+=====================
+
+.. warning::
+   As this class doesn't keep alive the objects and data it points to, their
+   lifetime must be ensured separately. We recommend using 
:class:`arrow::ArrayData`
+   instead.
+
+.. doxygenclass:: arrow::ArraySpan
+   :members:
+
 Utilities
 =========
 
diff --git a/docs/source/cpp/api/builder.rst b/docs/source/cpp/api/builder.rst
index 1342ba2655..d6532b03a2 100644
--- a/docs/source/cpp/api/builder.rst
+++ b/docs/source/cpp/api/builder.rst
@@ -15,10 +15,14 @@
 .. specific language governing permissions and limitations
 .. under the License.
 
+.. _cpp-api-array-builders:
+
 ==============
 Array Builders
 ==============
 
+.. seealso:: :ref:`cpp-api-buffer-builders` for direct construction of array 
buffers
+
 .. doxygenclass:: arrow::ArrayBuilder
    :members:
 
diff --git a/docs/source/cpp/api/memory.rst b/docs/source/cpp/api/memory.rst
index 9d12e4bdf0..49d30f683f 100644
--- a/docs/source/cpp/api/memory.rst
+++ b/docs/source/cpp/api/memory.rst
@@ -55,6 +55,16 @@ Buffers
 .. doxygenclass:: arrow::ResizableBuffer
    :members:
 
+Non-owning Buffer
+-----------------
+
+.. warning::
+   This class is exposed solely as a building block for 
:class:`arrow::ArraySpan`.
+   For any other purpose, please use :class:`arrow::Buffer`.
+
+.. doxygenclass:: arrow::BufferSpan
+   :members:
+
 Memory Pools
 ------------
 
@@ -91,6 +101,8 @@ Slicing
 .. doxygengroup:: buffer-slicing-functions
    :content-only:
 
+.. _cpp-api-buffer-builders:
+
 Buffer Builders
 ---------------
 
diff --git a/docs/source/cpp/conventions.rst b/docs/source/cpp/conventions.rst
index 8ea625c0b8..e82fd3ecb1 100644
--- a/docs/source/cpp/conventions.rst
+++ b/docs/source/cpp/conventions.rst
@@ -20,6 +20,8 @@
 
 .. cpp:namespace:: arrow
 
+.. _cpp-conventions:
+
 Conventions
 ===========
 
@@ -43,6 +45,10 @@ Safe pointers
 Arrow objects are usually passed and stored using safe pointers -- most of
 the time :class:`std::shared_ptr` but sometimes also :class:`std::unique_ptr`.
 
+Non-owning alternatives exist for the rare situations where the overhead of
+a safe pointer is considered unacceptable: :class:`ArraySpan` and 
:class:`BufferSpan`.
+Their usage in third-party code is not recommended.
+
 Immutability
 ------------
 
@@ -104,4 +110,3 @@ For example::
 
 .. seealso::
    :doc:`API reference for error reporting <api/support>`
-
diff --git a/docs/source/cpp/csv.rst b/docs/source/cpp/csv.rst
index bcb17bdc58..74ee0bb4fb 100644
--- a/docs/source/cpp/csv.rst
+++ b/docs/source/cpp/csv.rst
@@ -30,6 +30,8 @@ to create Arrow Tables or a stream of Arrow RecordBatches.
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
+.. _cpp-csv-reading:
+
 Reading CSV files
 =================
 
diff --git a/docs/source/cpp/ipc.rst b/docs/source/cpp/ipc.rst
index ce4175bca0..14ae060e5e 100644
--- a/docs/source/cpp/ipc.rst
+++ b/docs/source/cpp/ipc.rst
@@ -33,6 +33,8 @@ lower level input/output, handled through the :doc:`IO 
interfaces <io>`.
 For reading, there is also an event-driven API that enables feeding
 arbitrary data into the IPC decoding layer asynchronously.
 
+.. _cpp-ipc-reading:
+
 Reading IPC streams and files
 =============================
 
diff --git a/docs/source/cpp/parquet.rst b/docs/source/cpp/parquet.rst
index 8c55ec5d53..045f7f80f6 100644
--- a/docs/source/cpp/parquet.rst
+++ b/docs/source/cpp/parquet.rst
@@ -32,6 +32,8 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+.. _cpp-parquet-reading:
+
 Reading Parquet files
 =====================
 
diff --git a/docs/source/cpp/security.rst b/docs/source/cpp/security.rst
new file mode 100644
index 0000000000..ee35f7b296
--- /dev/null
+++ b/docs/source/cpp/security.rst
@@ -0,0 +1,152 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+
+.. _cpp-security:
+
+=======================
+Security Considerations
+=======================
+
+.. important::
+   This document describes the security model for using the Arrow C++ APIs.
+   For better understanding of this document, we recommend that you first read
+   the :ref:`overall security model <format_security>` for the Arrow project.
+
+API parameter validity
+======================
+
+Many Arrow C++ APIs report errors using the :class:`arrow::Status` and
+:class:`arrow::Result` types. Such APIs can be assumed to detect common errors
+in the provided arguments. However, there are also often implicit 
pre-conditions
+that have to be upheld; these can usually be deduced from the semantics of an
+API as described by its documentation.
+
+.. seealso:: Arrow C++ :ref:`cpp-conventions`
+
+Pointer validity
+----------------
+
+Pointers are always assumed to be valid and point to memory of the size 
required
+by the API. In particular, it is *forbidden to pass a null pointer* except 
where
+the API documentation explicitly says otherwise.
+
+Type restrictions
+-----------------
+
+Some APIs are specified to operate on specific Arrow data types and may not
+verify that their arguments conform to the expected data types. Passing the
+wrong kind of data as input may lead to undefined behavior.
+
+.. _cpp-valid-data:
+
+Data validity
+-------------
+
+Arrow data, for example passed as :class:`arrow::Array` or 
:class:`arrow::Table`,
+is always assumed to be :ref:`valid <format-invalid-data>`. If your program may
+encounter invalid data, it must explicitly check its validity by calling one of
+the following validation APIs.
+
+Structural validity
+'''''''''''''''''''
+
+The ``Validate`` methods exposed on various Arrow C++ classes perform 
relatively
+inexpensive validity checks that the data is structurally valid. This implies
+checking the number of buffers, child arrays, and other similar conditions.
+
+* :func:`arrow::Array::Validate`
+* :func:`arrow::RecordBatch::Validate`
+* :func:`arrow::ChunkedArray::Validate`
+* :func:`arrow::Table::Validate`
+* :func:`arrow::Scalar::Validate`
+
+These checks typically are constant-time against the number of rows in the 
data,
+but linear in the number of descendant fields. They can be good enough to 
detect
+potential bugs in your own code. However, they are not enough to detect all 
classes of
+invalid data, and they won't protect against all kinds of malicious payloads.
+
+Full validity
+'''''''''''''
+
+The ``ValidateFull`` methods exposed by the same classes perform the same 
validity
+checks as the ``Validate`` methods, but they also check the data extensively 
for
+any non-conformance to the Arrow spec. In particular, they check all the 
offsets
+of variable-length data types, which is of fundamental importance when 
ingesting
+untrusted data from sources such as the IPC format (otherwise the 
variable-length
+offsets could point outside of the corresponding data buffer). They also check
+for invalid values, such as invalid UTF-8 strings or decimal values out of 
range
+for the advertised precision.
+
+* :func:`arrow::Array::ValidateFull`
+* :func:`arrow::RecordBatch::ValidateFull`
+* :func:`arrow::ChunkedArray::ValidateFull`
+* :func:`arrow::Table::ValidateFull`
+* :func:`arrow::Scalar::ValidateFull`
+
+"Safe" and "unsafe" APIs
+------------------------
+
+Some APIs are exposed in both "safe" and "unsafe" variants. The naming 
convention
+for such pairs varies: sometimes the former has a ``Safe`` suffix (for example
+``SliceSafe`` vs. ``Slice``), sometimes the latter has an ``Unsafe`` prefix or
+suffix (for example ``Append`` vs. ``UnsafeAppend``).
+
+In all cases, the "unsafe" API is intended as a more efficient API that
+eschews some of the checks that the "safe" API performs. It is then up to the
+caller to ensure that the preconditions are met, otherwise undefined behavior
+may ensue.
+
+The API documentation usually spells out the differences between "safe" and 
"unsafe"
+variants, but these typically fall into two categories:
+
+* structural checks, such as passing the right Arrow data type or numbers of 
buffers;
+* allocation size checks, such as having preallocated enough data for the 
given input
+  arguments (this is typical of the :ref:`array builders 
<cpp-api-array-builders>`
+  and :ref:`buffer builders <cpp-api-buffer-builders>`).
+
+Ingesting untrusted data
+========================
+
+As an exception to the above (see :ref:`cpp-valid-data`), some APIs support 
ingesting
+untrusted, potentially malicious data. These are:
+
+* the :ref:`IPC reader <cpp-ipc-reading>` APIs
+* the :ref:`Parquet reader <cpp-parquet-reading>` APIs
+* the :ref:`CSV reader <cpp-csv-reading>` APIs
+
+IPC and Parquet readers
+-----------------------
+
+You must not assume that these will always return valid Arrow data. The reason
+for not validating data automatically is that validation can be expensive but
+unnecessary when reading from trusted data sources.
+
+Instead, when using these APIs with potentially invalid data (such as data 
coming
+from an untrusted source), you **must** follow these steps:
+
+1. Check any error returned by the API, as with any other API
+2. If the API returned successfully, validate the returned Arrow data in full
+   (see "Full validity" above)
+
+CSV reader
+----------
+
+With the default :class:`conversion options <arrow::csv::ConvertOptions>`,
+the CSV reader will either return valid Arrow data or error out. Some options,
+however, allow relaxing the corresponding checks in favor of performance.
diff --git a/docs/source/cpp/user_guide.rst b/docs/source/cpp/user_guide.rst
index 094859f9c5..722e9e50af 100644
--- a/docs/source/cpp/user_guide.rst
+++ b/docs/source/cpp/user_guide.rst
@@ -39,6 +39,7 @@ User Guide
    json
    dataset
    flight
+   security
    gdb
    threading
    opentelemetry
diff --git a/docs/source/format/Security.rst b/docs/source/format/Security.rst
index e14f07143c..8e630ea9a5 100644
--- a/docs/source/format/Security.rst
+++ b/docs/source/format/Security.rst
@@ -26,10 +26,15 @@ data from untrusted sources. It focuses specifically on 
data passed in a
 standardized serialized form (such as a IPC stream), as opposed to an
 implementation-specific native representation (such as ``arrow::Array`` in 
C++).
 
-.. note::
+.. important::
    Implementation-specific concerns, such as bad API usage, are out of scope
    for this document. Please refer to the implementation's own documentation.
 
+.. seealso::
+
+   Arrow C++ :ref:`cpp-security`
+      Security model for Arrow C++ APIs
+
 
 Who should read this
 ====================
@@ -49,6 +54,8 @@ You should read this document if you belong to either of 
these two categories:
 Columnar Format
 ===============
 
+.. _format-invalid-data:
+
 Invalid data
 ------------
 
@@ -89,8 +96,6 @@ explicitly validates any Arrow data it receives under 
serialized form
 from untrusted sources. Many Arrow implementations provide explicit APIs to
 perform such validation.
 
-.. TODO: link to some validation APIs for the main implementations here?
-
 Advice for implementors
 '''''''''''''''''''''''
 

Reply via email to