Also, the driver shouldn't assume UTF-8 (or any encoding) when constructing String from a Binary vector, since that defeats the point of a binary vector! Perhaps this should somehow be configurable (though having a lot of little configuration options is also not ideal). A parameterized extension type is probably the best way to solve this.
On Fri, Sep 30, 2022, at 13:04, Antoine Pitrou wrote: > Le 30/09/2022 à 18:57, Kevin Bambrick a écrit : >> The issue I am facing is sending a UTF-16 string over the wire. > > Ok, then you can just transcode the strings before sending them as > String, *or* you can send them as Binary (not String). > > Where do these UTF-16 strings come from? > > > What would the difference be between adding a new data type and an > > extension type for UTF-16? > > An extension type is for the most part a piece of metadata attached to > data represented in an existing data type (such as Binary), and that > consumers can optional recognize in order to better interpret the data. > > So if one were to make a UTF-16 extension type based on the Binary data > type, implementations could either recognize it as Binary or as UTF-16, > depending on whether they know about that particular extension type or not. > > (in practice, it would make more sense to make a parameterized "encoded > text" extension type, instead of making a specific one for UTF-16) > > I recommend reading about the Arrow columnar format and especially this > section about extension types: > https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > Regards > > Antoine.