This is an automated email from the ASF dual-hosted git repository.

lidavidm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new fd536176a7 GH-43453: [Format] Add Opaque canonical extension type 
(#43457)
fd536176a7 is described below

commit fd536176a7b19c73b63bd29acf536fbbb2d8083e
Author: David Li <[email protected]>
AuthorDate: Fri Aug 2 09:40:27 2024 +0900

    GH-43453: [Format] Add Opaque canonical extension type (#43457)
    
    ### Rationale for this change
    
    Add the newly ratified extension type.
    
    ### What changes are included in this PR?
    
    The type specification only.
    
    ### Are these changes tested?
    
    N/A
    
    ### Are there any user-facing changes?
    
    No.
    * GitHub Issue: #43453
    
    Lead-authored-by: David Li <[email protected]>
    Co-authored-by: Sutou Kouhei <[email protected]>
    Signed-off-by: David Li <[email protected]>
---
 docs/source/format/CanonicalExtensions.rst | 110 +++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)

diff --git a/docs/source/format/CanonicalExtensions.rst 
b/docs/source/format/CanonicalExtensions.rst
index c258f889dc..1d86fcf23c 100644
--- a/docs/source/format/CanonicalExtensions.rst
+++ b/docs/source/format/CanonicalExtensions.rst
@@ -283,6 +283,116 @@ UUID
    A specific UUID version is not required or guaranteed. This extension 
represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not 
interpret the bytes in any way.
 
+Opaque
+======
+
+Opaque represents a type that an Arrow-based system received from an external
+(often non-Arrow) system, but that it cannot interpret.  In this case, it can
+pass on Opaque to its clients to at least show that a field exists and
+preserve metadata about the type from the other system.
+
+Extension parameters:
+
+* Extension name: ``arrow.opaque``.
+
+* The storage type of this extension is any type.  If there is no underlying
+  data, the storage type should be Null.
+
+* Extension type parameters:
+
+  * **type_name** = the name of the unknown type in the external system.
+  * **vendor_name** = the name of the external system.
+
+* Description of the serialization:
+
+  A valid JSON object containing the parameters as fields.  In the future,
+  additional fields may be added, but all fields current and future are never
+  required to interpret the array.
+
+  Developers **should not** attempt to enable public semantic interoperability
+  of Opaque by canonicalizing specific values of these parameters.
+
+Rationale
+---------
+
+Interfacing with non-Arrow systems requires a way to handle data that doesn't
+have an equivalent Arrow type.  In this case, use the Opaque type, which
+explicitly represents an unsupported field.  Other solutions are inadequate:
+
+* Raising an error means even one unsupported field makes all operations
+  impossible, even if (for instance) the user is just trying to view a schema.
+* Dropping unsupported columns misleads the user as to the actual schema.
+* An extension type may not exist for the unsupported type.
+* Generating an extension type on the fly would falsely imply support.
+
+Applications **should not** make conventions around vendor_name and type_name.
+These parameters are meant for human end users to understand what type wasn't
+supported.  Applications may try to interpret these fields, but must be
+prepared for breakage (e.g., when the type becomes supported with a custom
+extension type later on).  Similarly, **Opaque is not a generic container for
+file formats**.  Considerations such as MIME types are irrelevant.  In both of
+these cases, create a custom extension type instead.
+
+Examples:
+
+* A Flight SQL service that supports connecting external databases may
+  encounter columns with unsupported types in external tables.  In this case,
+  it can use the Opaque[Null] type to at least report that a column exists
+  with a particular name and type name.  This lets clients know that a column
+  exists, but is not supported.  Null is used as the storage type here because
+  only schemas are involved.
+
+  An example of the extension metadata would be::
+
+    {"type_name": "varray", "vendor_name": "Oracle"}
+
+* The ADBC PostgreSQL driver gets results as a series of length-prefixed byte
+  fields.  But the driver will not always know how to parse the bytes, as
+  there may be extensions (e.g. PostGIS).  It can use Opaque[Binary] to still
+  return those bytes to the application, which may be able to parse the data
+  itself.  Opaque differentiates the column from an actual binary column and
+  makes it clear that the value is directly from PostgreSQL.  (A custom
+  extension type is preferred, but there will always be extensions that the
+  driver does not know about.)
+
+  An example of the extension metadata would be::
+
+    {"type_name": "geometry", "vendor_name": "PostGIS"}
+
+* The ADBC PostgreSQL driver may also know how to parse the bytes, but not
+  know the intended semantics.  For example, `composite types
+  <https://www.postgresql.org/docs/current/rowtypes.html>`_ can add new
+  semantics to existing types, somewhat like Arrow extension types.  The
+  driver would be able to parse the underlying bytes in this case, but would
+  still use the Opaque type.
+
+  Consider the example in the PostgreSQL documentation of a ``complex`` type.
+  Mapping the type to a plain Arrow ``struct`` type would lose meaning, just
+  like how an Arrow system deciding to treat all extension types by dropping
+  the extension metadata would be undesirable.  Instead, the driver can use
+  Opaque[Struct] to pass on the composite type info.  (It would be wrong to
+  try to map this to an Arrow-defined complex type: it does not know the
+  proper semantics of a user-defined type, which cannot and should not be
+  hardcoded into the driver in the first place.)
+
+  An example of the extension metadata would be::
+
+    {"type_name": "database_name.schema_name.complex", "vendor_name": 
"PostgreSQL"}
+
+* The JDBC adapter in the Arrow Java libraries converts JDBC result sets into
+  Arrow arrays, and can get Arrow schemas from result sets.  JDBC, however,
+  allows drivers to return `arbitrary Java objects
+  <https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#OTHER>`_.
+
+  The driver can use Opaque[Null] as a placeholder during schema conversion,
+  only erroring if the application tries to fetch the actual data.  That way,
+  clients can at least introspect result schemas to decide whether it can
+  proceed to fetch the data, or only query certain columns.
+
+  An example of the extension metadata would be::
+
+    {"type_name": "OTHER", "vendor_name": "JDBC driver name"}
+
 =========================
 Community Extension Types
 =========================

Reply via email to