lidavidm commented on code in PR #41823: URL: https://github.com/apache/arrow/pull/41823#discussion_r1636068603
########## docs/source/format/CanonicalExtensions.rst: ########## @@ -283,6 +283,77 @@ UUID A specific UUID version is not required or guaranteed. This extension represents UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way. +Other +===== + +Other represents a type or array that one Arrow-based system received from an +external (likely non-Arrow) system, but cannot interpret itself. In this +case, the Other type explicitly communicates the name and presence of a field +to downstream clients. + +For example: + +* A Flight SQL service may support connecting external databases. In this + case, its catalog (``GetTables`` etc.) should reflect the names and types of + tables in external databases. But those external systems may support types + it does not recognize. Instead of erroring or silently dropping columns + from the catalog, it can use the Other[Null] type to report that a column + exists with a particular name and type name in the external database; this + lets clients know that a column exists, but is not supported. + +* The ADBC PostgreSQL driver, because of how the PostgreSQL wire protocol + works, may get bytes for a field whose type it does not recognize (say, a + geospatial type). It can still return the bytes to the application which + may be able to parse the data itself. In that case, it can use the + Other[binary] type to return the column data. The Other type differentiates + the column from actual binary columns. + +Of course, the intermediate system *could* implement a custom extension type +for these example types. But there is no way in general that every type can +be known in advance. In such cases, the Other type allows the system to +explicitly note that it does not support some type or field, without silently +losing data or sending irrelevant errors. It could also pretend to support +the types by making up extension types on the fly. But this misleads +downstream systems who cannot tell if the type is supported or not. + +Extension parameters: + +* Extension name: ``arrow.other``. + +* The storage type of this extension is any type. If there is no underlying + data, the storage type should be Null. If there is data (because the system + got bytes or some other data it does not know how to interpret), the storage + type should preferably be binary or fixed-size binary, but may be any type. + +* Extension type parameters: + + * **type_name** = the name of the unknown type in the external system. + * **vendor_name** = the name of the external system. + +* Description of the serialization: + + A valid JSON object containing the parameters as fields. In the future, + additional fields may be added, but all fields current and future are never + required to interpret the array. + + For example: + + - The PostgreSQL ``polygon`` type may be represented as Other[binary] with + metadata ``{"type_name": "polygon", "vendor_name": "PostgreSQL"}``. + - The PostGIS ``geometry`` type may be represented as Other[binary] with + metadata ``{"type_name": "geometry", "vendor_name": "PostGIS"}``. + - A Flight SQL service may return an array type as Other[Null] with metadata + ``{"type_name": "varray", "vendor_name": "Oracle"}``. Review Comment: Another example is if we have a JDBC adapter (which: we do!) JDBC allows returning arbitrary Java objects (and this is part of the JDBC type system too), which we have no way of mapping to Arrow types generically. Right now this is just a hard error in all cases, which makes the JDBC adapter less useful (you may still be interested in the types of other columns which can be converted) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org