Re: [PR] WIP: [Format] Add Other canonical extension type [arrow]

via GitHub Thu, 13 Jun 2024 05:09:10 -0700


jorisvandenbossche commented on code in PR #41823:
URL: https://github.com/apache/arrow/pull/41823#discussion_r1638091587



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,132 @@ UUID
    A specific UUID version is not required or guaranteed. This extension 
represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not 
interpret the bytes in any way.
 
+Unknown
+=======
+
+Unknown represents a type or array that an Arrow-based system received from an
+external (often non-Arrow) system, which it cannot interpret itself or did not
+have support for in advance.  In this case, it can pass on Unknown to its own
+clients to communicate that a field exists, but that it cannot interpret the
+field or data.
+
+Extension parameters:
+
+* Extension name: ``arrow.unknown``.
+
+* The storage type of this extension is any type.  If there is no underlying
+  data, the storage type should be Null.  If there is data, the storage type
+  should preferably be binary or fixed-size binary, but may be any type.
+
+* Extension type parameters:
+
+  * **type_name** = the name of the unknown type in the external system.
+  * **vendor_name** = the name of the external system.
+
+* Description of the serialization:
+
+  A valid JSON object containing the parameters as fields.  In the future,
+  additional fields may be added, but all fields current and future are never
+  required to interpret the array.
+
+Examples:
+
+* Consider a Flight SQL service that supports connecting external databases.
+  Its clients may request the names and types of columns of tables in those
+  databases, but then there may be types that the Flight SQL service does not
+  recognize, due to lack of support or because those systems have their own
+  extensions or user-defined types.
+
+  The Flight SQL service can use the Unknown[Null] type to report that a
+  column exists with a particular name and type name in the external database.
+  This lets clients know that a column exists, but is not supported.  Null is
+  used as the storage type here because only schemas are involved.
+
+  The client would presumably not be able to query such columns from the
+  Flight SQL service, but there may be other columns in the table that it
+  could query, or it could prepare a query that references the unknown column
+  in an expression and produces a result that *is* supported.  The Unknown
+  type is a better experience than erroring or silently dropping columns from
+  the catalog.
+
+  An example of the extension metadata would be::
+
+    {"type_name": "varray", "vendor_name": "Oracle"}
+
+* The ADBC PostgreSQL driver may get bytes for a field whose type it does not
+  recognize.  This is because of how PostgreSQL and its wire protocol work:
+  the driver will always get bytes for fields and must implement support for
+  all potential types to interpret those bytes.  But the driver cannot know
+  about all types in advance, as there may be extensions (e.g. PostGIS for
+  geospatial functionality).
+
+  Beacuse the driver still has the raw bytes, it can use Unknown[Binary] to
+  return those bytes to the application, which may be able to parse the data
+  itself.  Unknown differentiates the column from an actual binary column.
+
+  An example of the extension metadata would be::
+
+    {"type_name": "geometry", "vendor_name": "PostGIS"}
+
+* The ADBC PostgreSQL driver may also get bytes for a field whose type it can
+  only partially recognize.  For example, PostgreSQL supports `composite types
+  <https://www.postgresql.org/docs/current/rowtypes.html>`_ that ascribe new
+  semantics to existing types, somewhat like Arrow extension types.
+
+  The driver would be able to parse the underlying type in this case.
+  However, the driver may still with to use the Unknown type.  Consider the

Review Comment:
   ```suggestion
     However, the driver may still want to use the Unknown type.  Consider the
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,132 @@ UUID
    A specific UUID version is not required or guaranteed. This extension 
represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not 
interpret the bytes in any way.
 
+Unknown
+=======
+
+Unknown represents a type or array that an Arrow-based system received from an
+external (often non-Arrow) system, which it cannot interpret itself or did not
+have support for in advance.  In this case, it can pass on Unknown to its own
+clients to communicate that a field exists, but that it cannot interpret the
+field or data.
+
+Extension parameters:
+
+* Extension name: ``arrow.unknown``.
+
+* The storage type of this extension is any type.  If there is no underlying
+  data, the storage type should be Null.  If there is data, the storage type
+  should preferably be binary or fixed-size binary, but may be any type.
+
+* Extension type parameters:
+
+  * **type_name** = the name of the unknown type in the external system.
+  * **vendor_name** = the name of the external system.
+
+* Description of the serialization:
+
+  A valid JSON object containing the parameters as fields.  In the future,
+  additional fields may be added, but all fields current and future are never
+  required to interpret the array.
+
+Examples:
+
+* Consider a Flight SQL service that supports connecting external databases.
+  Its clients may request the names and types of columns of tables in those
+  databases, but then there may be types that the Flight SQL service does not
+  recognize, due to lack of support or because those systems have their own
+  extensions or user-defined types.
+
+  The Flight SQL service can use the Unknown[Null] type to report that a
+  column exists with a particular name and type name in the external database.
+  This lets clients know that a column exists, but is not supported.  Null is
+  used as the storage type here because only schemas are involved.
+
+  The client would presumably not be able to query such columns from the
+  Flight SQL service, but there may be other columns in the table that it
+  could query, or it could prepare a query that references the unknown column
+  in an expression and produces a result that *is* supported.  The Unknown
+  type is a better experience than erroring or silently dropping columns from
+  the catalog.
+
+  An example of the extension metadata would be::
+
+    {"type_name": "varray", "vendor_name": "Oracle"}
+
+* The ADBC PostgreSQL driver may get bytes for a field whose type it does not
+  recognize.  This is because of how PostgreSQL and its wire protocol work:
+  the driver will always get bytes for fields and must implement support for
+  all potential types to interpret those bytes.  But the driver cannot know
+  about all types in advance, as there may be extensions (e.g. PostGIS for
+  geospatial functionality).
+
+  Beacuse the driver still has the raw bytes, it can use Unknown[Binary] to
+  return those bytes to the application, which may be able to parse the data
+  itself.  Unknown differentiates the column from an actual binary column.
+
+  An example of the extension metadata would be::
+
+    {"type_name": "geometry", "vendor_name": "PostGIS"}
+
+* The ADBC PostgreSQL driver may also get bytes for a field whose type it can
+  only partially recognize.  For example, PostgreSQL supports `composite types
+  <https://www.postgresql.org/docs/current/rowtypes.html>`_ that ascribe new
+  semantics to existing types, somewhat like Arrow extension types.
+
+  The driver would be able to parse the underlying type in this case.
+  However, the driver may still with to use the Unknown type.  Consider the
+  example in the PostgreSQL documentation above of a ``complex`` type.  Just
+  mapping the type to a plain Arrow ``struct`` type would lose the semantics
+  of that custom type.  In this case, the driver can use Unknown[Struct].  The
+  driver would never actually be able to directly support the type in this
+  example, since these types are defined by database administrators, not by
+  the developers.
+
+  An example of the extension metadata would be::
+
+    {"type_name": "database_name.schema_name.complex", "vendor_name": 
"PostgreSQL"}
+
+* The JDBC adapter in the Arrow Java libraries converts JDBC result sets into
+  Arrow arrays, and also to get Arrow schemas from result sets.  JDBC,
+  however, allows drivers to return `arbitrary Java objects
+  <https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#OTHER>`_.
+
+  Currently, the JDBC adapter simply errors, making usage of the adapter a
+  minefield where results are all-or-nothing, even if an application just
+  wants to fetch a schema.  Instead, the driver could use Unknown[Null] as a

Review Comment:
   This "currently" will get out of date rather quickly (assuming there are 
plans to make use of this extension type in the JDBC adapter)? So maybe could 
also future proof it by saying something like "Without this extension type, the 
adapter would simply error .."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] WIP: [Format] Add Other canonical extension type [arrow]

Reply via email to