The issue I am facing is sending a UTF-16 string over the wire. The
application I am working on needs to support UTF-16 strings. The specific
issue I am stuck on is integrating with the flight SQL driver
(experimentally working on uptaking it for when its released). Right now in
my implementation of the FlightSqlProducer Strings are transported by
dumping the UTF-16 bytes of the string in a VarBinaryVector and ship that
over the wire.

The issue then occurns when on the ResulSet, the client calls
".getString(int columnindex)". The ArrowFlightJdbcBinaryVectorAccessor
implementation of getString trys to construct the string from the bytes at
this index as UTF-8. This makes sense to do as the UTF-8 type is the only
string type supported in arrow. In my case though, trying to ship UTF16
String data this way leads to incorrect data when going through the driver.

As things currently stand, my current solution around this is to maybe fork
the driver code and change that implementation in
ArrowFlightJdbcBinaryVectorAccessor to construct the String as UTF-16
instead. I would prefer not to do this but if UTF-16 strings are not going
to by a type supported by arrow I'm not sure how else the flight driver
would discern when to deserialize the bytes in a VarBinaryVector as UTF-8
or UTF-16.

What would the difference be between adding a new data type and an
extension type for UTF-16?

Regards,
Kevin.



On Fri, Sep 30, 2022 at 10:17 AM Antoine Pitrou <anto...@python.org> wrote:

> On Thu, 29 Sep 2022 15:19:59 -0400
> Larry White <ljw1...@gmail.com> wrote:
> > Interesting. This doesn't seem to be a Java issue, per se then. I've seen
> > admonations in various Arrow Java threads to always specify the Charset
> for
> > the conversion - and so assumed more than one Charset was legal - and
> have
> > written Arrow Java test code that uses other charsets without ill effect.
> >
> > I've never attempted to transport that data over the wire or export it
> > using the C-Data Interface, however. It seems like that's where it would
> > fall down.
>
> For performance, most consumers of Arrow data would not necessarily
> check that it's valid utf-8. They would however definitely misinterpret
> it.
>
> The "string" (also called "utf8" in some implementations) data type is
> definitely specified as being valid utf-8.
>
> Given the dwindling popularity of utf-16 and the growing universality
> of utf-8, I don't think it would be a good idea to add another datatype
> for it. However, an extension type would be doable.
>
> I think a step back is needed first: what is the use case for
> transporting utf-16 data in Arrow?
>
> Regards
>
> Antoine.
>
>
>
> >
> > On Thu, Sep 29, 2022 at 3:01 PM James Henderson <j...@juxt.pro> wrote:
> >
> > > FWIW we'd made a similar assumption. In Schema.fbs [1] the type is
> called
> > > Utf8, as well as the Java `ArrowType.Utf8` class - is this a required
> > > assumption to work with other language Arrow libs, maybe?
> > >
> > > James
> > >
> > > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs
> > >
> > > On Thu, 29 Sept 2022 at 18:57, Larry White <ljw1...@gmail.com> wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > I don't know of any particular restriction regarding string encoding.
> > > > VarCharVector stores data as a byte array, and the encoding can be
> set
> > > > using the Charset class when you convert Strings to and from bytes.
> Since
> > > > java strings use UTF-16 internally, I would expect this to 'just
> work'.
> > > >
> > > > larry
> > > >
> > > > On Thu, Sep 29, 2022 at 12:46 PM Kevin Bambrick <
> > > kevinbambri...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > Was just wondering was support for UTF-16 Strings considered? As
> far
> > > as I
> > > > > am aware VarChar vectors only support UTF-8. Are they something
> that
> > > may
> > > > be
> > > > > supported in the future?
> > > > >
> > > > > Regards.
> > > > > Kevin.
> > > > >
> > > >
> > >
> > >
> > > --
> > > *James Henderson*
> > > XTDB Development Manager at *JUXT*
> > >
> > > Email j...@juxt.pro
> > > Website https://juxt.pro
> > >
> > > [image: photo]
> > >
> >
>
>
>
>

Reply via email to