paleolimbot commented on code in PR #41823:
URL: https://github.com/apache/arrow/pull/41823#discussion_r1635666920
##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,77 @@ UUID
A specific UUID version is not required or guaranteed. This extension
represents
UUIDs as FixedSizeBinary(16) with big-endian notation and does not
interpret the bytes in any way.
+Other
+=====
+
+Other represents a type or array that one Arrow-based system received from an
+external (likely non-Arrow) system, but cannot interpret itself. In this
+case, the Other type explicitly communicates the name and presence of a field
+to downstream clients.
+
+For example:
+
+* A Flight SQL service may support connecting external databases. In this
+ case, its catalog (``GetTables`` etc.) should reflect the names and types of
+ tables in external databases. But those external systems may support types
+ it does not recognize. Instead of erroring or silently dropping columns
+ from the catalog, it can use the Other[Null] type to report that a column
+ exists with a particular name and type name in the external database; this
+ lets clients know that a column exists, but is not supported.
+
+* The ADBC PostgreSQL driver, because of how the PostgreSQL wire protocol
+ works, may get bytes for a field whose type it does not recognize (say, a
+ geospatial type). It can still return the bytes to the application which
+ may be able to parse the data itself. In that case, it can use the
+ Other[binary] type to return the column data. The Other type differentiates
+ the column from actual binary columns.
+
+Of course, the intermediate system *could* implement a custom extension type
+for these example types. But there is no way in general that every type can
+be known in advance. In such cases, the Other type allows the system to
+explicitly note that it does not support some type or field, without silently
+losing data or sending irrelevant errors. It could also pretend to support
+the types by making up extension types on the fly. But this misleads
+downstream systems who cannot tell if the type is supported or not.
+
+Extension parameters:
+
+* Extension name: ``arrow.other``.
+
+* The storage type of this extension is any type. If there is no underlying
+ data, the storage type should be Null. If there is data (because the system
+ got bytes or some other data it does not know how to interpret), the storage
+ type should preferably be binary or fixed-size binary, but may be any type.
Review Comment:
> I don't understand at all. Perhaps you can explain using a concrete
example of information loss?
This is the example I had in mind.
```
import duckdb
duckdb.sql("INSTALL spatial;")
duckdb.sql("LOAD spatial;")
result = duckdb.sql("SELECT ST_Point2D(0, 1) as geom")
result.types
#> [POINT_2D]
tab = result.to_arrow_table()
tab.schema.field("geom")
#> pyarrow.Field<geom: struct<x: double, y: double>>
```
The information being lost is specifically the name of the type and any
information that lived at the type level. I don't think it's in scope to pass
along arbitrary type-level information (in this case maybe a coordinate
reference system), but at least with Other there is a way to signal that
information loss occurred without the producer having to abort whenever it sees
a logical type that doesn't specifically have an Arrow equivalent whose values
will be interpreted correctly.
I don't want to get hung up on DuckDB specifically (they might not be
interested or might be able to implement the ability for runtime-loadable
extensions to customize their Arrow export representation before they get a
chance to implement this), I just wanted to demonstrate an example where the
arbitrary payload that an ADBC driver (or general Arrow consumer) receives is
not binary.
> Perhaps if DuckDB returns an extension type that is not registered at
consumer side?
I think that producers can and should export *Arrow* extension types
whenever possible! Runtime-loadable extensions have come up as examples here a
few times because they are an example of types that a general-purpose
something-to-Arrow converter (e.g., ADBC drivers, FlightSQL implementations,
DuckDB's Arrow exporter) can never know at compile time.
> But why would we use Other in this case?
Maybe circling back here: if an ADBC driver gets the query `"SELECT
ST_Point2D(0, 1) as geom"`, `AdbcStatementExecuteSchema()` would have a
mechanism to communicate the type it received.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]