Mentioned this already to Joris, but want to make sure we don't miss it.

C-Data and thus ARROW:extension:metadata was mostly designed for shipping
data to different processes within the same host.
If we start using the spec for further uses, including saving it to files
that could be read across different architectures.
ARROW:extension:metadata doesn't in any way specify endianess of fields
like int32 num_items, int32 name_len etc... and that's something we must do
if we plan to ship that data through files or networks.

Json would definitely get rid of the endianess problem at the cost of a
greater size and a more complex parser. But there are superminimal json
parsers designed specifically for embedding like Jasmine (
https://github.com/zserge/jsmn )

On Wed, Feb 9, 2022 at 2:51 AM Dewey Dunnington <de...@voltrondata.com>
wrote:

> I'll share a bit more about geospatial extension types that Joris
> mentioned. I'm new to the Arrow community and didn't know that there were
> any restrictions on metadata values (the C Data interface docs don't seem
> to indicate that there are restrictions, or if it's there I missed it!), so
> I used the same encoding for the ARROW:extension:metadata that's used to
> encode the parent metadata (int32 num_items, int32 name_len,
> char[name_len], int32 value_len, char[value_len],  etc..). I did this
> because I needed two key/value pairs (geodesic = true/false; crs =
> some_coordinate_reference_system) and already had the code to iterate over
> the parent metadata. I'm not saying that it's any pinnacle of elegant code
> (still very much a prototype), but it only takes about 30 lines of C to do
> this [1].
>
> I prototyped the extension types for geospatial using the C data interface,
> the idea being that a header-only helper file (geoarrow.hpp) could be
> distributed that would make it an attractive and easy alternative to
> well-known binary (WKB) to pass geometries around between libraries (e.g.,
> GEOS, GDAL, PROJ). Requiring anybody who uses an extension type to also
> vendor a JSON parser [2] seems a bit anti-social and restricts where that
> extension type is useful, although I understand that it's not the use case
> that many might have.
>
> There are definitely reasonable ways to do what I'm trying to do without
> resorting to a binary encoding, and JSON could probably even work...I'm
> just trying to share the use-case since it seems like this kind of
> environment isn't how folks envisioned extension types being used.
>
> [1]
>
> https://github.com/paleolimbot/geoarrow/blob/master/src/internal/geoarrow.hpp#L511-L542
> [2] The commonly vendored JSON parser in geospatial libraries is this one:
> https://github.com/nlohmann/json
>
> On Tue, Feb 8, 2022 at 7:58 PM Weston Pace <weston.p...@gmail.com> wrote:
>
> > I think I'm +0 but lean slightly towards JSON.
> >
> > In favor of binary I would guess that most extension types are going
> > to have relatively simple parameterization (to the point that
> > protobuf/flatbuffers isn't really needed).  For example, the substrate
> > consumer PR has five extension types at the moment (e.g. uuid,
> > varchar) and only two of them are parameterized and each of these by a
> > single int32_t.  It might be interesting to see what kinds of
> > extension types the geospatial community uses.
> >
> > That being said, this sort of parsing isn't really on any kind of
> > critical path.  It's very likely that users (not Arrow developers)
> > will be creating and working with extension types.  These users are
> > likely going to default to JSON (or pickle or XML).  If our "well
> > known types" use JSON then it will be more easily recognizable to
> > users what is going on.
> >
> > -Weston
> >
> > On Tue, Feb 8, 2022 at 8:14 AM Joris Van den Bossche
> > <jorisvandenboss...@gmail.com> wrote:
> > >
> > > On Tue, 8 Feb 2022 at 17:37, Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com>
> > > wrote:
> > >
> > > > ...
> > > >
> > > > Wrt to binary, imo the challenge is:
> > > > * we state that backward incompatible changes to the c data interface
> > > > require a new spec [1]
> > > >
> > >
> > > Note that this discussion wouldn't change anything about the C Data
> > > Interface spec itself. The discussion is only about the *value* that is
> > put
> > > in one of the key-value metadata fields. The C Data Interface spec
> > defines
> > > how the metadata needs to be stored, but doesn't specify anything about
> > the
> > > actual value of one of the key-value metadata fields.
> > >
> > >
> > > > * we state that the metadata is a binary string [2]
> > > > * a valid string is a subset of all valid byte arrays and thus
> > removing "
> > > > *string*" from the spec is backward incompatible
> > > >
> > > > If we write invalid utf8 to it and a reader assumes utf8 when reading
> > it,
> > > > we trigger undefined behavior.
> > > >
> > > > I was a bit surprised by ARROW-15613 - my understanding is that the
> c++
> > > > implementation is not following the spec, and if we at arrow2 were
> not
> > be
> > > > checking for utf8, we would be exposing a vulnerability (at least
> > according
> > > > to Rust's standards). We just checked it out of luck (it is O(1), so
> > why
> > > > not).
> > > >
> > >
> > > Yes, the C++ implementation is indeed not following the spec. See the
> > > "[DISCUSS] Binary Values in Key value pairs" thread (
> > > https://lists.apache.org/thread/blmj0cgv34dgdxqd3ow60ln68khnz0qr).
> Let's
> > > maybe keep this part of the discussion there?
> >
>

Reply via email to