Re: [DISCUSS] Splitting out the Arrow format directory

Weston Pace Wed, 11 Aug 2021 16:05:14 -0700

>> The benefit is that IR components don't interact much with `flatbuffers` or
>> `flatc` directly.
>>
[...]
>>
>> One counter-proposal might be to just put the compute IR IDL in a separate
>> repo,
>> but that isn't tenable because the compute IR needs arrow's type information
>> contained in `Schema.fbs`.


> This argument seems predated on the hypothesis that the compute IR will
> use Flatbuffers.  Is it set in stone?

+1 for the original proposal (mirror repo for specs).  I don't think
we have to figure out the IR format.  It makes sense for all language
independent specs to be in a single place regardless of format.  If IR
picked JSON I would still argue the JSON schemas for IR belong in the
same repository as the Arrow columnar format flatbuffers files.  It
makes it clear what is spec and what is implementation / toolkit.
Especially since a mirror repo should be pretty low maintenance.

On Wed, Aug 11, 2021 at 11:34 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 11/08/2021 à 23:06, Phillip Cloud a écrit :
> > On Wed, Aug 11, 2021 at 4:22 PM Antoine Pitrou <anto...@python.org> wrote:
> >
> >> Le 11/08/2021 à 22:16, Phillip Cloud a écrit :
> >>>
> >>> Yeah, that is a drawback here, though I don't see needing to run flatc
> >> as a
> >>> major downside given the upside
> >>> of not having to write additional code to move between formats.
> >>
> >> That's only an advantage if you already know how to read the Arrow IPC
> >> format (and, yes, in this case you already run `flatc`).  Some projects
> >> probably don't care about Arrow IPC (Dask, for example).
> >
> >
> > I don't think it's about the IPC though, at least for the compute IR use
> > case.
> > Am I missing something there?
>
> If you're not handling the Arrow IPC format, then you probably don't
> have an encoder/decoder for Schema.fbs, so the "upside of not having to
> write additional code to move between formats" doesn't exist (unless I'm
> misunderstanding your point?).
>
> > I do think a downside of not using something like JSON or msgpack is
> > that schema validation must be implemented by both the producer and the
> > consumer.
> > That means we'd have a number of other consequential decisions to make:
> >
> > * Do we provide the validation library?
> > * If not, do all the languages arrow supports have high quality libraries
> > for validating schemas?
> > * If so, then we have to implement/maintain/release/bugfix that.
>
> This is true.  However, Flatbuffers doesn't validate much on its own,
> either, because its IDL is not expressive enough.  For example,
> `Schema.fbs` allows you to declare a INT8 field with children, a LIST
> field without any children, a non-nullable NULL field...
>
> (also, there's JSON Schema: https://json-schema.org/)
>
> Regards
>
> Antoine.

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to