Re: Re: Human-readable version of Arrow Schema

2020-05-08 Thread Christian Hudon
Hi Hans,

Cool. In case it wasn't clear though, I didn't decide on any of those field
names (or even the structure) for my approach. I serialize the Schema C++
object to Flatbuffers (with the already existing Flatbuffers schema
definition), and then use the Flatbuffers library functionality to convert
a Flatbuffers object to JSON. So Flatbuffers is doing most of that work,
here. But happy to hear this is inspiring for you. What's your use case for
this?

  Christian



Le jeu. 7 mai 2020, à 05 h 46,  a écrit :

> Hi Chris,
>
> nice work. I am actually doing the same thing from the Python side and got
> a similar result. Only differences are
>  - marking the JSON structure as a "schema"
>  - using factory function names as "datatype" (see
> https://arrow.apache.org/docs/python/api/datatypes.html)
>  - adding metadata
>
> I would be glad in helping to bring this nice idea to real life. Just
> downloaded your code and started playing with the C side to see the
> differences, already adopted your "children" idea as you will see. I am
> looking foreward to a fruitful discussion. Here is my Python result in JSON:
>
> {
> "schema": {
> "fields": [{
> "name": "name",
> "datatype": "string",
> "nullable": false,
> "metadata": {
> "m1": "meta 1",
> "m2": "meta 2",
> "m3": "meta 3"
> },
> "children": []
> },
> {
> "name": "description",
> "datatype": "string",
> "nullable": true,
> "metadata": {
> "m1": "meta 1",
> "m2": "meta 2",
> "m3": "meta 3"
> },
>     "children": []
>     }
> ],
> "metadata": {
> "m1": "meta 1",
> "m2": "meta 2",
> "m3": "meta 3"
> }
> }
> }
>
> Cheers,
> Hans
>
> > Gesendet: Dienstag, 05. Mai 2020 um 20:28 Uhr
> > Von: "Christian Hudon" 
> > An: "dev@arrow.apache.org" 
> > Betreff: Re: Human-readable version of Arrow Schema?
> >
> > Hi folks! I'm back.
> >
> > Yes to François's comments. This has to be something that is readable by
> > data scientists, researchers, etc. without having the doc side-by-side,
> > which is definitely not the case for the C-interface representation.
> >
> > I've created a draft pull request with code that's definitely not ready
> to
> > be merged, but works enough to output a Flatbuffers JSON representation
> of
> > an Arrow schema, so people can see what it would look like, experiment,
> etc.
> >
> > An an example, the following Arrow schema:
> >
> >   std::vector> schema_vector = {
> > arrow::field("id", arrow::int64()),
> > arrow::field("cost", arrow::float64()),
> > arrow::field("cost_components", arrow::list(arrow::float64()))};
> >   auto schema = arrow::Schema(schema_vector);
> >
> > translates to (with some reformatting to make things more compact):
> >
> > {
> >   fields: [
> > {name: "id", nullable: true, type_type: "Int", type: {bitWidth:
> > 64, is_signed: true},
> >   children: []},
> > {name: "cost", nullable: true, type_type: "FloatingPoint", type:
> > {precision: "DOUBLE"},
> >   children: []},
> > {name: "cost_components", nullable: true, type_type: "List", type:
> {},
> >   children: [
> > {name: "item", nullable: true, type_type: "FloatingPoint", type:
> > {precision: "DOUBLE"},
> >   children: []}
> >   ]}
> >   ]
> > }
> >
> > 

Aw: Re: Human-readable version of Arrow Schema

2020-05-07 Thread hans-joachim . bothe
Hi Chris,

nice work. I am actually doing the same thing from the Python side and got a 
similar result. Only differences are
 - marking the JSON structure as a "schema"
 - using factory function names as "datatype" (see 
https://arrow.apache.org/docs/python/api/datatypes.html)
 - adding metadata

I would be glad in helping to bring this nice idea to real life. Just 
downloaded your code and started playing with the C side to see the 
differences, already adopted your "children" idea as you will see. I am looking 
foreward to a fruitful discussion. Here is my Python result in JSON:

{
"schema": {
"fields": [{
"name": "name",
"datatype": "string",
"nullable": false,
"metadata": {
"m1": "meta 1",
"m2": "meta 2",
"m3": "meta 3"
},
"children": []
},
{
"name": "description",
"datatype": "string",
"nullable": true,
"metadata": {
"m1": "meta 1",
"m2": "meta 2",
"m3": "meta 3"
},
"children": []
}
],
        "metadata": {
    "m1": "meta 1",
"m2": "meta 2",
"m3": "meta 3"
}
}
}

Cheers,
Hans

> Gesendet: Dienstag, 05. Mai 2020 um 20:28 Uhr
> Von: "Christian Hudon" 
> An: "dev@arrow.apache.org" 
> Betreff: Re: Human-readable version of Arrow Schema?
>
> Hi folks! I'm back.
> 
> Yes to François's comments. This has to be something that is readable by
> data scientists, researchers, etc. without having the doc side-by-side,
> which is definitely not the case for the C-interface representation.
> 
> I've created a draft pull request with code that's definitely not ready to
> be merged, but works enough to output a Flatbuffers JSON representation of
> an Arrow schema, so people can see what it would look like, experiment, etc.
> 
> An an example, the following Arrow schema:
> 
>   std::vector> schema_vector = {
> arrow::field("id", arrow::int64()),
> arrow::field("cost", arrow::float64()),
> arrow::field("cost_components", arrow::list(arrow::float64()))};
>   auto schema = arrow::Schema(schema_vector);
> 
> translates to (with some reformatting to make things more compact):
> 
> {
>   fields: [
> {name: "id", nullable: true, type_type: "Int", type: {bitWidth:
> 64, is_signed: true},
>   children: []},
> {name: "cost", nullable: true, type_type: "FloatingPoint", type:
> {precision: "DOUBLE"},
>   children: []},
> {name: "cost_components", nullable: true, type_type: "List", type: {},
>   children: [
> {name: "item", nullable: true, type_type: "FloatingPoint", type:
> {precision: "DOUBLE"},
>   children: []}
>   ]}
>   ]
> }
> 
> I can definitely see data scientists being able to understand that or make
> small changes without the doc, and even write one from scratch with some
> help from documentation. It could even be made more compact by making a few
> fields optional when empty (children, type).
> 
> If you want to try it out on other schemas, here's the pull request:
> https://github.com/apache/arrow/pull/7110
> 
> Thoughts?
> 
> 
> Le jeu. 9 janv. 2020, à 08 h 47, Francois Saint-Jacques <
> fsaintjacq...@gmail.com> a écrit :
> 
> > The desired goal for this feature is trivial modifications, e.g.
> > within an editor, by data-scientists and researchers.
> >
> > I'd go for the flatbuffer's json representation as it is stable and
> > has native support in almost any language or editor due to the
> > ubiquity of JSON. The C interface schema string representation is
> > optimized for developers writing

Re: Human-readable version of Arrow Schema?

2020-05-05 Thread Christian Hudon
Hi folks! I'm back.

Yes to François's comments. This has to be something that is readable by
data scientists, researchers, etc. without having the doc side-by-side,
which is definitely not the case for the C-interface representation.

I've created a draft pull request with code that's definitely not ready to
be merged, but works enough to output a Flatbuffers JSON representation of
an Arrow schema, so people can see what it would look like, experiment, etc.

An an example, the following Arrow schema:

  std::vector> schema_vector = {
arrow::field("id", arrow::int64()),
arrow::field("cost", arrow::float64()),
arrow::field("cost_components", arrow::list(arrow::float64()))};
  auto schema = arrow::Schema(schema_vector);

translates to (with some reformatting to make things more compact):

{
  fields: [
{name: "id", nullable: true, type_type: "Int", type: {bitWidth:
64, is_signed: true},
  children: []},
{name: "cost", nullable: true, type_type: "FloatingPoint", type:
{precision: "DOUBLE"},
  children: []},
{name: "cost_components", nullable: true, type_type: "List", type: {},
  children: [
{name: "item", nullable: true, type_type: "FloatingPoint", type:
{precision: "DOUBLE"},
  children: []}
  ]}
  ]
}

I can definitely see data scientists being able to understand that or make
small changes without the doc, and even write one from scratch with some
help from documentation. It could even be made more compact by making a few
fields optional when empty (children, type).

If you want to try it out on other schemas, here's the pull request:
https://github.com/apache/arrow/pull/7110

Thoughts?


Le jeu. 9 janv. 2020, à 08 h 47, Francois Saint-Jacques <
fsaintjacq...@gmail.com> a écrit :

> The desired goal for this feature is trivial modifications, e.g.
> within an editor, by data-scientists and researchers.
>
> I'd go for the flatbuffer's json representation as it is stable and
> has native support in almost any language or editor due to the
> ubiquity of JSON. The C interface schema string representation is
> optimized for developers writing parser/codecs and looks like
> gibberish to anyone not familiar with python's struct format string.
>
> François
>
>
> On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai  wrote:
> >
> > Hello,
> >
> > pg2arrow [*1] has '--dump' mode to print out schema definition of the
> > given Apache Arrow file.
> > Does it make sense for you?
> >
> > $ ./pg2arrow --dump ~/hoge.arrow
> > [Footer]
> > {Footer: version=V4, schema={Schema: endianness=little,
> > fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
> > custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
> > children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
> > type={Decimal: precision=11, scale=7}, children=[],
> > custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
> > children=[{Field: name="x", nullable=true, type={Int32}, children=[],
> > custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
> > children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
> > type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
> > {Field: name="d", nullable=true, type={Utf8},
> > dictionary={DictionaryEncoding: id=0, indexType={Int32},
> > isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
> > nullable=true, type={Timestamp: unit=us}, children=[],
> > custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
> > children=[], custom_metadata=[]}, {Field: name="random",
> > nullable=true, type={Float64}, children=[], custom_metadata=[]}],
> > custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
> > FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
> > bodyLength=128}], recordBatches=[{Block: offset=1232,
> > metaDataLength=648 bodyLength=386112}]}
> > [Dictionary Batch 0]
> > {Block: offset=920, metaDataLength=184 bodyLength=128}
> > {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
> > length=6, nodes=[{FieldNode: length=6, null_count=0}],
> > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
> > {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
> > [Record Batch 0]
> > {Block: offset=1232, metaDataLength=648 bodyLength=386112}
> > {Message: version=V4, body={RecordBatch: length=3000,
> > nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
> > length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
> > {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
> > null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
> > length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
> > {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
> > null_count=0}, {FieldNode: length=3000, null_count=0}],
> > buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
> > length=12032}, {Buffer: offset=12032, length=384}, 

Re: Human-readable version of Arrow Schema?

2020-01-09 Thread Francois Saint-Jacques
The desired goal for this feature is trivial modifications, e.g.
within an editor, by data-scientists and researchers.

I'd go for the flatbuffer's json representation as it is stable and
has native support in almost any language or editor due to the
ubiquity of JSON. The C interface schema string representation is
optimized for developers writing parser/codecs and looks like
gibberish to anyone not familiar with python's struct format string.

François


On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai  wrote:
>
> Hello,
>
> pg2arrow [*1] has '--dump' mode to print out schema definition of the
> given Apache Arrow file.
> Does it make sense for you?
>
> $ ./pg2arrow --dump ~/hoge.arrow
> [Footer]
> {Footer: version=V4, schema={Schema: endianness=little,
> fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
> custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
> children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
> type={Decimal: precision=11, scale=7}, children=[],
> custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
> children=[{Field: name="x", nullable=true, type={Int32}, children=[],
> custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
> children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
> type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
> {Field: name="d", nullable=true, type={Utf8},
> dictionary={DictionaryEncoding: id=0, indexType={Int32},
> isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
> nullable=true, type={Timestamp: unit=us}, children=[],
> custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
> children=[], custom_metadata=[]}, {Field: name="random",
> nullable=true, type={Float64}, children=[], custom_metadata=[]}],
> custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
> FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
> bodyLength=128}], recordBatches=[{Block: offset=1232,
> metaDataLength=648 bodyLength=386112}]}
> [Dictionary Batch 0]
> {Block: offset=920, metaDataLength=184 bodyLength=128}
> {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
> length=6, nodes=[{FieldNode: length=6, null_count=0}],
> buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
> {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
> [Record Batch 0]
> {Block: offset=1232, metaDataLength=648 bodyLength=386112}
> {Message: version=V4, body={RecordBatch: length=3000,
> nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
> length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
> {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
> null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
> length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
> {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
> null_count=0}, {FieldNode: length=3000, null_count=0}],
> buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
> length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
> offset=12416, length=24000}, {Buffer: offset=36416, length=384},
> {Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
> length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
> length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
> offset=97600, length=12032}, {Buffer: offset=109632, length=0},
> {Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
> length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
> offset=217664, length=12032}, {Buffer: offset=229696, length=384},
> {Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
> length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
> offset=266112, length=96000}, {Buffer: offset=362112, length=0},
> {Buffer: offset=362112, length=24000}]}, bodyLength=386112}
>
> [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow
>
> 2019年12月7日(土) 6:26 Christian Hudon :
> >
> > Hi,
> >
> > For the uses I would like to make of Arrow, I would need a human-readable
> > and -writable version of an Arrow Schema, that could be converted to and
> > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> > see anything to that effect, with the closest being the ToString() method
> > on DataType instances, but which is meant for debugging only. (I need an
> > expression of an Arrow Schema that people can read, and that can live
> > outside of the code for a particular operation.)
> >
> > Is a text representation of an Arrow Schema something that is being worked
> > on now? If not, would you folks be interested in me putting up an initial
> > proposal for discussion? Any design constraints I should pay attention to,
> > then?
> >
> > Thanks,
> >
> >   Christian
> > --
> >
> >
> > │ Christian Hudon
> >
> > │ Applied Research Scientist
> >
> >Element AI, 6650 Saint-Urbain #500
> >
> >Montréal, QC, H2S 3G9, Canada
> 

Re: Human-readable version of Arrow Schema?

2020-01-08 Thread Kohei KaiGai
Hello,

pg2arrow [*1] has '--dump' mode to print out schema definition of the
given Apache Arrow file.
Does it make sense for you?

$ ./pg2arrow --dump ~/hoge.arrow
[Footer]
{Footer: version=V4, schema={Schema: endianness=little,
fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
type={Decimal: precision=11, scale=7}, children=[],
custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
children=[{Field: name="x", nullable=true, type={Int32}, children=[],
custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
{Field: name="d", nullable=true, type={Utf8},
dictionary={DictionaryEncoding: id=0, indexType={Int32},
isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
nullable=true, type={Timestamp: unit=us}, children=[],
custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
children=[], custom_metadata=[]}, {Field: name="random",
nullable=true, type={Float64}, children=[], custom_metadata=[]}],
custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
bodyLength=128}], recordBatches=[{Block: offset=1232,
metaDataLength=648 bodyLength=386112}]}
[Dictionary Batch 0]
{Block: offset=920, metaDataLength=184 bodyLength=128}
{Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
length=6, nodes=[{FieldNode: length=6, null_count=0}],
buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
{Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
[Record Batch 0]
{Block: offset=1232, metaDataLength=648 bodyLength=386112}
{Message: version=V4, body={RecordBatch: length=3000,
nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
{FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
{FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
null_count=0}, {FieldNode: length=3000, null_count=0}],
buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
offset=12416, length=24000}, {Buffer: offset=36416, length=384},
{Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
offset=97600, length=12032}, {Buffer: offset=109632, length=0},
{Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
offset=217664, length=12032}, {Buffer: offset=229696, length=384},
{Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
offset=266112, length=96000}, {Buffer: offset=362112, length=0},
{Buffer: offset=362112, length=24000}]}, bodyLength=386112}

[*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow

2019年12月7日(土) 6:26 Christian Hudon :
>
> Hi,
>
> For the uses I would like to make of Arrow, I would need a human-readable
> and -writable version of an Arrow Schema, that could be converted to and
> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> see anything to that effect, with the closest being the ToString() method
> on DataType instances, but which is meant for debugging only. (I need an
> expression of an Arrow Schema that people can read, and that can live
> outside of the code for a particular operation.)
>
> Is a text representation of an Arrow Schema something that is being worked
> on now? If not, would you folks be interested in me putting up an initial
> proposal for discussion? Any design constraints I should pay attention to,
> then?
>
> Thanks,
>
>   Christian
> --
>
>
> │ Christian Hudon
>
> │ Applied Research Scientist
>
>Element AI, 6650 Saint-Urbain #500
>
>Montréal, QC, H2S 3G9, Canada
>Elementai.com



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


Re: Human-readable version of Arrow Schema?

2020-01-08 Thread Micah Kornfield
The C-interface representation is probably slightly less readable then the
JSON implementation if I understand the flatbuffer to JSON conversion
properly.  But as Antoine pointed out it depends on the use-case.

FWIW, flatbuffers maintainers indicated forward/backward compatibility is
intended to be preserved in the JSON encoding as well.

On Sat, Jan 4, 2020 at 2:16 PM Jacques Nadeau  wrote:

> What do people think about using the C interface representation?
>
> On Sun, Dec 29, 2019 at 12:42 PM Micah Kornfield 
> wrote:
>
>> I opened https://github.com/google/flatbuffers/issues/5688 to try to get
>> some clarity.
>>
>> On Tue, Dec 24, 2019 at 12:13 PM Wes McKinney 
>> wrote:
>>
>> > On Tue, Dec 24, 2019 at 2:47 AM Micah Kornfield 
>> > wrote:
>> > >>
>> > >> If we were to make the same kinds of forward/backward compatibility
>> > >> guarantees as with Flatbuffers it could create a lot of work for
>> > >> maintainers.
>> > >
>> > > Does it pay to follow-up with the flatbuffer project to understand if
>> > the forward/backward compatibility guarantees the flatbuffers provide
>> > extend to their JSON format?
>> >
>> > I spent a few minutes looking at the Flatbuffers codebase and
>> > documentation and did not find anything, so this seems like useful
>> > information to have regardless.
>> >
>> > >
>> > > On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney 
>> > wrote:
>> > >>
>> > >> I'd be open to looking at a proposal for a human-readable text
>> > >> representation, but I'm definitely wary about making any kind of
>> > >> cross-version compatibility guarantees (beyond "we will do our
>> best").
>> > >> If we were to make the same kinds of forward/backward compatibility
>> > >> guarantees as with Flatbuffers it could create a lot of work for
>> > >> maintainers.
>> > >>
>> > >> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield <
>> emkornfi...@gmail.com>
>> > wrote:
>> > >> >
>> > >> > >
>> > >> > > With these two together, it would seem not too difficult to
>> create
>> > a text
>> > >> > > representation for Arrow schemas that (at some point) has some
>> > >> > > compatibility guarantees, but maybe I'm missing something?
>> > >> >
>> > >> >
>> > >> > I think the main risk is if somehow flatbuffers JSON parsing
>> doesn't
>> > handle
>> > >> > backward compatible changes to the arrow schema message.  Given the
>> > way the
>> > >> > documentation is describing the JSON functionality I think this
>> would
>> > be
>> > >> > considered a bug.
>> > >> >
>> > >> > The one downside to calling the "schema" canonical is the
>> flatbuffers
>> > JSON
>> > >> > functionality only appears to be available in C++ and Java via JNI,
>> > so it
>> > >> > wouldn't have cross language support.  I think this issue is more
>> one
>> > of
>> > >> > semantics though (i.e. does the JSON description become part of the
>> > "Arrow
>> > >> > spec" or does it live as a C++/Python only feature).
>> > >> >
>> > >> > -Micah
>> > >> >
>> > >> >
>> > >> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon <
>> > chr...@elementai.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Micah: I didn't know that Flatbuffers supported serialization
>> > to/from JSON,
>> > >> > > thanks. That seems like a very good start, at least. I'll aim to
>> > create a
>> > >> > > draft pull request that at least wires everything up in Arrow so
>> we
>> > can
>> > >> > > load/save a Schema.fbs instance from/to JSON. At least it'll make
>> > it easier
>> > >> > > for me to see how Arrow schemas would look in JSON with that.
>> > >> > >
>> > >> > > Otherwise, I'm still gathering requirements internally here. For
>> > example,
>> > >> > > one thing that would be nice would be to be able to output a JSON
>> > Schema
>> > >> > > from at least a subset of the Arrow schema. (That way our users
>> > could start
>> > >> > > by passing around JSON with a given schema, and transition pieces
>> > of a
>> > >> > > workflow to Arrow as they're ready.) But that part can also be
>> done
>> > outside
>> > >> > > of the Arrow code, if deemed not relevant to have in the Arrow
>> > codebase
>> > >> > > itself.
>> > >> > >
>> > >> > > One core requirement for us, however, would be eventual
>> > compatibility
>> > >> > > between Arrow versions for a given text representation of a
>> schema.
>> > >> > > Meaning, if you have a text description of a given Arrow schema,
>> > you can
>> > >> > > load it into different versions of Arrow and it creates a valid
>> > Schema
>> > >> > > Flatbuffer description, that Arrow can use. Wes, were you
>> thinking
>> > of that,
>> > >> > > or of something else, when you wrote "only makes sense if it is
>> > offered
>> > >> > > without any backward/forward compatibility guarantees"?
>> > >> > >
>> > >> > > For the now, or me, assuming the JSON serialization done by the
>> > Flatbuffer
>> > >> > > libraries is usable, it seems we have all the pieces to make this
>> > happen:
>> > >> > > 1) The binary Schema.fbs data structures has to be compatible
>> > between
>> > 

Re: Human-readable version of Arrow Schema?

2020-01-04 Thread Antoine Pitrou


Le 04/01/2020 à 23:17, Jacques Nadeau a écrit :
> I guess we'd still need to introduce a way to nest, it only has type
> representation.

Right.  Before exploring this direction more in depth, I think it would
be useful to know what the intended use case is.  Perhaps the OP
(Christian Hudon) would like to elaborate a bit?

Regards

Antoine.


Re: Human-readable version of Arrow Schema?

2020-01-04 Thread Jacques Nadeau
I guess we'd still need to introduce a way to nest, it only has type
representation.

On Sat, Jan 4, 2020 at 2:16 PM Jacques Nadeau  wrote:

> What do people think about using the C interface representation?
>
> On Sun, Dec 29, 2019 at 12:42 PM Micah Kornfield 
> wrote:
>
>> I opened https://github.com/google/flatbuffers/issues/5688 to try to get
>> some clarity.
>>
>> On Tue, Dec 24, 2019 at 12:13 PM Wes McKinney 
>> wrote:
>>
>> > On Tue, Dec 24, 2019 at 2:47 AM Micah Kornfield 
>> > wrote:
>> > >>
>> > >> If we were to make the same kinds of forward/backward compatibility
>> > >> guarantees as with Flatbuffers it could create a lot of work for
>> > >> maintainers.
>> > >
>> > > Does it pay to follow-up with the flatbuffer project to understand if
>> > the forward/backward compatibility guarantees the flatbuffers provide
>> > extend to their JSON format?
>> >
>> > I spent a few minutes looking at the Flatbuffers codebase and
>> > documentation and did not find anything, so this seems like useful
>> > information to have regardless.
>> >
>> > >
>> > > On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney 
>> > wrote:
>> > >>
>> > >> I'd be open to looking at a proposal for a human-readable text
>> > >> representation, but I'm definitely wary about making any kind of
>> > >> cross-version compatibility guarantees (beyond "we will do our
>> best").
>> > >> If we were to make the same kinds of forward/backward compatibility
>> > >> guarantees as with Flatbuffers it could create a lot of work for
>> > >> maintainers.
>> > >>
>> > >> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield <
>> emkornfi...@gmail.com>
>> > wrote:
>> > >> >
>> > >> > >
>> > >> > > With these two together, it would seem not too difficult to
>> create
>> > a text
>> > >> > > representation for Arrow schemas that (at some point) has some
>> > >> > > compatibility guarantees, but maybe I'm missing something?
>> > >> >
>> > >> >
>> > >> > I think the main risk is if somehow flatbuffers JSON parsing
>> doesn't
>> > handle
>> > >> > backward compatible changes to the arrow schema message.  Given the
>> > way the
>> > >> > documentation is describing the JSON functionality I think this
>> would
>> > be
>> > >> > considered a bug.
>> > >> >
>> > >> > The one downside to calling the "schema" canonical is the
>> flatbuffers
>> > JSON
>> > >> > functionality only appears to be available in C++ and Java via JNI,
>> > so it
>> > >> > wouldn't have cross language support.  I think this issue is more
>> one
>> > of
>> > >> > semantics though (i.e. does the JSON description become part of the
>> > "Arrow
>> > >> > spec" or does it live as a C++/Python only feature).
>> > >> >
>> > >> > -Micah
>> > >> >
>> > >> >
>> > >> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon <
>> > chr...@elementai.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Micah: I didn't know that Flatbuffers supported serialization
>> > to/from JSON,
>> > >> > > thanks. That seems like a very good start, at least. I'll aim to
>> > create a
>> > >> > > draft pull request that at least wires everything up in Arrow so
>> we
>> > can
>> > >> > > load/save a Schema.fbs instance from/to JSON. At least it'll make
>> > it easier
>> > >> > > for me to see how Arrow schemas would look in JSON with that.
>> > >> > >
>> > >> > > Otherwise, I'm still gathering requirements internally here. For
>> > example,
>> > >> > > one thing that would be nice would be to be able to output a JSON
>> > Schema
>> > >> > > from at least a subset of the Arrow schema. (That way our users
>> > could start
>> > >> > > by passing around JSON with a given schema, and transition pieces
>> > of a
>> > >> > > workflow to Arrow as they're ready.) But that part can also be
>> done
>> > outside
>> > >> > > of the Arrow code, if deemed not relevant to have in the Arrow
>> > codebase
>> > >> > > itself.
>> > >> > >
>> > >> > > One core requirement for us, however, would be eventual
>> > compatibility
>> > >> > > between Arrow versions for a given text representation of a
>> schema.
>> > >> > > Meaning, if you have a text description of a given Arrow schema,
>> > you can
>> > >> > > load it into different versions of Arrow and it creates a valid
>> > Schema
>> > >> > > Flatbuffer description, that Arrow can use. Wes, were you
>> thinking
>> > of that,
>> > >> > > or of something else, when you wrote "only makes sense if it is
>> > offered
>> > >> > > without any backward/forward compatibility guarantees"?
>> > >> > >
>> > >> > > For the now, or me, assuming the JSON serialization done by the
>> > Flatbuffer
>> > >> > > libraries is usable, it seems we have all the pieces to make this
>> > happen:
>> > >> > > 1) The binary Schema.fbs data structures has to be compatible
>> > between
>> > >> > > different versions of Arrow, otherwise two processes with
>> different
>> > Arrow
>> > >> > > versions won't be able to interoperate, no?
>> > >> > > 2) The Flatbuffer <-> JSON serialization supplied by the
>> Flatbuffers
>> > >> > > library also 

Re: Human-readable version of Arrow Schema?

2020-01-04 Thread Jacques Nadeau
What do people think about using the C interface representation?

On Sun, Dec 29, 2019 at 12:42 PM Micah Kornfield 
wrote:

> I opened https://github.com/google/flatbuffers/issues/5688 to try to get
> some clarity.
>
> On Tue, Dec 24, 2019 at 12:13 PM Wes McKinney  wrote:
>
> > On Tue, Dec 24, 2019 at 2:47 AM Micah Kornfield 
> > wrote:
> > >>
> > >> If we were to make the same kinds of forward/backward compatibility
> > >> guarantees as with Flatbuffers it could create a lot of work for
> > >> maintainers.
> > >
> > > Does it pay to follow-up with the flatbuffer project to understand if
> > the forward/backward compatibility guarantees the flatbuffers provide
> > extend to their JSON format?
> >
> > I spent a few minutes looking at the Flatbuffers codebase and
> > documentation and did not find anything, so this seems like useful
> > information to have regardless.
> >
> > >
> > > On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney 
> > wrote:
> > >>
> > >> I'd be open to looking at a proposal for a human-readable text
> > >> representation, but I'm definitely wary about making any kind of
> > >> cross-version compatibility guarantees (beyond "we will do our best").
> > >> If we were to make the same kinds of forward/backward compatibility
> > >> guarantees as with Flatbuffers it could create a lot of work for
> > >> maintainers.
> > >>
> > >> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > wrote:
> > >> >
> > >> > >
> > >> > > With these two together, it would seem not too difficult to create
> > a text
> > >> > > representation for Arrow schemas that (at some point) has some
> > >> > > compatibility guarantees, but maybe I'm missing something?
> > >> >
> > >> >
> > >> > I think the main risk is if somehow flatbuffers JSON parsing doesn't
> > handle
> > >> > backward compatible changes to the arrow schema message.  Given the
> > way the
> > >> > documentation is describing the JSON functionality I think this
> would
> > be
> > >> > considered a bug.
> > >> >
> > >> > The one downside to calling the "schema" canonical is the
> flatbuffers
> > JSON
> > >> > functionality only appears to be available in C++ and Java via JNI,
> > so it
> > >> > wouldn't have cross language support.  I think this issue is more
> one
> > of
> > >> > semantics though (i.e. does the JSON description become part of the
> > "Arrow
> > >> > spec" or does it live as a C++/Python only feature).
> > >> >
> > >> > -Micah
> > >> >
> > >> >
> > >> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon <
> > chr...@elementai.com>
> > >> > wrote:
> > >> >
> > >> > > Micah: I didn't know that Flatbuffers supported serialization
> > to/from JSON,
> > >> > > thanks. That seems like a very good start, at least. I'll aim to
> > create a
> > >> > > draft pull request that at least wires everything up in Arrow so
> we
> > can
> > >> > > load/save a Schema.fbs instance from/to JSON. At least it'll make
> > it easier
> > >> > > for me to see how Arrow schemas would look in JSON with that.
> > >> > >
> > >> > > Otherwise, I'm still gathering requirements internally here. For
> > example,
> > >> > > one thing that would be nice would be to be able to output a JSON
> > Schema
> > >> > > from at least a subset of the Arrow schema. (That way our users
> > could start
> > >> > > by passing around JSON with a given schema, and transition pieces
> > of a
> > >> > > workflow to Arrow as they're ready.) But that part can also be
> done
> > outside
> > >> > > of the Arrow code, if deemed not relevant to have in the Arrow
> > codebase
> > >> > > itself.
> > >> > >
> > >> > > One core requirement for us, however, would be eventual
> > compatibility
> > >> > > between Arrow versions for a given text representation of a
> schema.
> > >> > > Meaning, if you have a text description of a given Arrow schema,
> > you can
> > >> > > load it into different versions of Arrow and it creates a valid
> > Schema
> > >> > > Flatbuffer description, that Arrow can use. Wes, were you thinking
> > of that,
> > >> > > or of something else, when you wrote "only makes sense if it is
> > offered
> > >> > > without any backward/forward compatibility guarantees"?
> > >> > >
> > >> > > For the now, or me, assuming the JSON serialization done by the
> > Flatbuffer
> > >> > > libraries is usable, it seems we have all the pieces to make this
> > happen:
> > >> > > 1) The binary Schema.fbs data structures has to be compatible
> > between
> > >> > > different versions of Arrow, otherwise two processes with
> different
> > Arrow
> > >> > > versions won't be able to interoperate, no?
> > >> > > 2) The Flatbuffer <-> JSON serialization supplied by the
> Flatbuffers
> > >> > > library also has to be compatible between different versions of
> the
> > >> > > Flatbuffers library, since the main use case seems to be storing
> > >> > > Flatbuffers assets into version control. Breaking changes there
> > will also
> > >> > > be painful to their users.
> > >> > >
> > >> > > With 

Re: Human-readable version of Arrow Schema?

2019-12-29 Thread Micah Kornfield
I opened https://github.com/google/flatbuffers/issues/5688 to try to get
some clarity.

On Tue, Dec 24, 2019 at 12:13 PM Wes McKinney  wrote:

> On Tue, Dec 24, 2019 at 2:47 AM Micah Kornfield 
> wrote:
> >>
> >> If we were to make the same kinds of forward/backward compatibility
> >> guarantees as with Flatbuffers it could create a lot of work for
> >> maintainers.
> >
> > Does it pay to follow-up with the flatbuffer project to understand if
> the forward/backward compatibility guarantees the flatbuffers provide
> extend to their JSON format?
>
> I spent a few minutes looking at the Flatbuffers codebase and
> documentation and did not find anything, so this seems like useful
> information to have regardless.
>
> >
> > On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney 
> wrote:
> >>
> >> I'd be open to looking at a proposal for a human-readable text
> >> representation, but I'm definitely wary about making any kind of
> >> cross-version compatibility guarantees (beyond "we will do our best").
> >> If we were to make the same kinds of forward/backward compatibility
> >> guarantees as with Flatbuffers it could create a lot of work for
> >> maintainers.
> >>
> >> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield 
> wrote:
> >> >
> >> > >
> >> > > With these two together, it would seem not too difficult to create
> a text
> >> > > representation for Arrow schemas that (at some point) has some
> >> > > compatibility guarantees, but maybe I'm missing something?
> >> >
> >> >
> >> > I think the main risk is if somehow flatbuffers JSON parsing doesn't
> handle
> >> > backward compatible changes to the arrow schema message.  Given the
> way the
> >> > documentation is describing the JSON functionality I think this would
> be
> >> > considered a bug.
> >> >
> >> > The one downside to calling the "schema" canonical is the flatbuffers
> JSON
> >> > functionality only appears to be available in C++ and Java via JNI,
> so it
> >> > wouldn't have cross language support.  I think this issue is more one
> of
> >> > semantics though (i.e. does the JSON description become part of the
> "Arrow
> >> > spec" or does it live as a C++/Python only feature).
> >> >
> >> > -Micah
> >> >
> >> >
> >> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon <
> chr...@elementai.com>
> >> > wrote:
> >> >
> >> > > Micah: I didn't know that Flatbuffers supported serialization
> to/from JSON,
> >> > > thanks. That seems like a very good start, at least. I'll aim to
> create a
> >> > > draft pull request that at least wires everything up in Arrow so we
> can
> >> > > load/save a Schema.fbs instance from/to JSON. At least it'll make
> it easier
> >> > > for me to see how Arrow schemas would look in JSON with that.
> >> > >
> >> > > Otherwise, I'm still gathering requirements internally here. For
> example,
> >> > > one thing that would be nice would be to be able to output a JSON
> Schema
> >> > > from at least a subset of the Arrow schema. (That way our users
> could start
> >> > > by passing around JSON with a given schema, and transition pieces
> of a
> >> > > workflow to Arrow as they're ready.) But that part can also be done
> outside
> >> > > of the Arrow code, if deemed not relevant to have in the Arrow
> codebase
> >> > > itself.
> >> > >
> >> > > One core requirement for us, however, would be eventual
> compatibility
> >> > > between Arrow versions for a given text representation of a schema.
> >> > > Meaning, if you have a text description of a given Arrow schema,
> you can
> >> > > load it into different versions of Arrow and it creates a valid
> Schema
> >> > > Flatbuffer description, that Arrow can use. Wes, were you thinking
> of that,
> >> > > or of something else, when you wrote "only makes sense if it is
> offered
> >> > > without any backward/forward compatibility guarantees"?
> >> > >
> >> > > For the now, or me, assuming the JSON serialization done by the
> Flatbuffer
> >> > > libraries is usable, it seems we have all the pieces to make this
> happen:
> >> > > 1) The binary Schema.fbs data structures has to be compatible
> between
> >> > > different versions of Arrow, otherwise two processes with different
> Arrow
> >> > > versions won't be able to interoperate, no?
> >> > > 2) The Flatbuffer <-> JSON serialization supplied by the Flatbuffers
> >> > > library also has to be compatible between different versions of the
> >> > > Flatbuffers library, since the main use case seems to be storing
> >> > > Flatbuffers assets into version control. Breaking changes there
> will also
> >> > > be painful to their users.
> >> > >
> >> > > With these two together, it would seem not too difficult to create
> a text
> >> > > representation for Arrow schemas that (at some point) has some
> >> > > compatibility guarantees, but maybe I'm missing something?
> >> > >
> >> > > Thanks,
> >> > >
> >> > >   Christian
> >> > >
> >> > > Le lun. 9 déc. 2019, à 07 h 00, Wes McKinney 
> a
> >> > > écrit :
> >> > >
> >> > > > The only "canonical" representation 

Re: Human-readable version of Arrow Schema?

2019-12-23 Thread Micah Kornfield
>
> If we were to make the same kinds of forward/backward compatibility
> guarantees as with Flatbuffers it could create a lot of work for
> maintainers.

Does it pay to follow-up with the flatbuffer project to understand if the
forward/backward compatibility guarantees the flatbuffers provide extend to
their JSON format?

On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney  wrote:

> I'd be open to looking at a proposal for a human-readable text
> representation, but I'm definitely wary about making any kind of
> cross-version compatibility guarantees (beyond "we will do our best").
> If we were to make the same kinds of forward/backward compatibility
> guarantees as with Flatbuffers it could create a lot of work for
> maintainers.
>
> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield 
> wrote:
> >
> > >
> > > With these two together, it would seem not too difficult to create a
> text
> > > representation for Arrow schemas that (at some point) has some
> > > compatibility guarantees, but maybe I'm missing something?
> >
> >
> > I think the main risk is if somehow flatbuffers JSON parsing doesn't
> handle
> > backward compatible changes to the arrow schema message.  Given the way
> the
> > documentation is describing the JSON functionality I think this would be
> > considered a bug.
> >
> > The one downside to calling the "schema" canonical is the flatbuffers
> JSON
> > functionality only appears to be available in C++ and Java via JNI, so it
> > wouldn't have cross language support.  I think this issue is more one of
> > semantics though (i.e. does the JSON description become part of the
> "Arrow
> > spec" or does it live as a C++/Python only feature).
> >
> > -Micah
> >
> >
> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon 
> > wrote:
> >
> > > Micah: I didn't know that Flatbuffers supported serialization to/from
> JSON,
> > > thanks. That seems like a very good start, at least. I'll aim to
> create a
> > > draft pull request that at least wires everything up in Arrow so we can
> > > load/save a Schema.fbs instance from/to JSON. At least it'll make it
> easier
> > > for me to see how Arrow schemas would look in JSON with that.
> > >
> > > Otherwise, I'm still gathering requirements internally here. For
> example,
> > > one thing that would be nice would be to be able to output a JSON
> Schema
> > > from at least a subset of the Arrow schema. (That way our users could
> start
> > > by passing around JSON with a given schema, and transition pieces of a
> > > workflow to Arrow as they're ready.) But that part can also be done
> outside
> > > of the Arrow code, if deemed not relevant to have in the Arrow codebase
> > > itself.
> > >
> > > One core requirement for us, however, would be eventual compatibility
> > > between Arrow versions for a given text representation of a schema.
> > > Meaning, if you have a text description of a given Arrow schema, you
> can
> > > load it into different versions of Arrow and it creates a valid Schema
> > > Flatbuffer description, that Arrow can use. Wes, were you thinking of
> that,
> > > or of something else, when you wrote "only makes sense if it is offered
> > > without any backward/forward compatibility guarantees"?
> > >
> > > For the now, or me, assuming the JSON serialization done by the
> Flatbuffer
> > > libraries is usable, it seems we have all the pieces to make this
> happen:
> > > 1) The binary Schema.fbs data structures has to be compatible between
> > > different versions of Arrow, otherwise two processes with different
> Arrow
> > > versions won't be able to interoperate, no?
> > > 2) The Flatbuffer <-> JSON serialization supplied by the Flatbuffers
> > > library also has to be compatible between different versions of the
> > > Flatbuffers library, since the main use case seems to be storing
> > > Flatbuffers assets into version control. Breaking changes there will
> also
> > > be painful to their users.
> > >
> > > With these two together, it would seem not too difficult to create a
> text
> > > representation for Arrow schemas that (at some point) has some
> > > compatibility guarantees, but maybe I'm missing something?
> > >
> > > Thanks,
> > >
> > >   Christian
> > >
> > > Le lun. 9 déc. 2019, à 07 h 00, Wes McKinney  a
> > > écrit :
> > >
> > > > The only "canonical" representation of schemas at the moment is the
> > > > Flatbuffers data structure [1]
> > > >
> > > > Having a human-readable/parseable text representation I think only
> > > > makes sense if it is offered without any backward/forward
> > > > compatibility guarantees.
> > > >
> > > > Note I had previously opened
> > > > https://issues.apache.org/jira/browse/ARROW-3730 where I noted that
> > > > there's no way (aside from generating the Flatbuffers messages) to
> > > > generate a schema representation that can be used later to
> reconstruct
> > > > a schema in a program. If such a representation were human
> > > > readable/editable that seems beneficial.
> > > >
> > > >
> > > >
> > > > [1]: 

Re: Human-readable version of Arrow Schema?

2019-12-11 Thread Micah Kornfield
>
> With these two together, it would seem not too difficult to create a text
> representation for Arrow schemas that (at some point) has some
> compatibility guarantees, but maybe I'm missing something?


I think the main risk is if somehow flatbuffers JSON parsing doesn't handle
backward compatible changes to the arrow schema message.  Given the way the
documentation is describing the JSON functionality I think this would be
considered a bug.

The one downside to calling the "schema" canonical is the flatbuffers JSON
functionality only appears to be available in C++ and Java via JNI, so it
wouldn't have cross language support.  I think this issue is more one of
semantics though (i.e. does the JSON description become part of the "Arrow
spec" or does it live as a C++/Python only feature).

-Micah


On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon 
wrote:

> Micah: I didn't know that Flatbuffers supported serialization to/from JSON,
> thanks. That seems like a very good start, at least. I'll aim to create a
> draft pull request that at least wires everything up in Arrow so we can
> load/save a Schema.fbs instance from/to JSON. At least it'll make it easier
> for me to see how Arrow schemas would look in JSON with that.
>
> Otherwise, I'm still gathering requirements internally here. For example,
> one thing that would be nice would be to be able to output a JSON Schema
> from at least a subset of the Arrow schema. (That way our users could start
> by passing around JSON with a given schema, and transition pieces of a
> workflow to Arrow as they're ready.) But that part can also be done outside
> of the Arrow code, if deemed not relevant to have in the Arrow codebase
> itself.
>
> One core requirement for us, however, would be eventual compatibility
> between Arrow versions for a given text representation of a schema.
> Meaning, if you have a text description of a given Arrow schema, you can
> load it into different versions of Arrow and it creates a valid Schema
> Flatbuffer description, that Arrow can use. Wes, were you thinking of that,
> or of something else, when you wrote "only makes sense if it is offered
> without any backward/forward compatibility guarantees"?
>
> For the now, or me, assuming the JSON serialization done by the Flatbuffer
> libraries is usable, it seems we have all the pieces to make this happen:
> 1) The binary Schema.fbs data structures has to be compatible between
> different versions of Arrow, otherwise two processes with different Arrow
> versions won't be able to interoperate, no?
> 2) The Flatbuffer <-> JSON serialization supplied by the Flatbuffers
> library also has to be compatible between different versions of the
> Flatbuffers library, since the main use case seems to be storing
> Flatbuffers assets into version control. Breaking changes there will also
> be painful to their users.
>
> With these two together, it would seem not too difficult to create a text
> representation for Arrow schemas that (at some point) has some
> compatibility guarantees, but maybe I'm missing something?
>
> Thanks,
>
>   Christian
>
> Le lun. 9 déc. 2019, à 07 h 00, Wes McKinney  a
> écrit :
>
> > The only "canonical" representation of schemas at the moment is the
> > Flatbuffers data structure [1]
> >
> > Having a human-readable/parseable text representation I think only
> > makes sense if it is offered without any backward/forward
> > compatibility guarantees.
> >
> > Note I had previously opened
> > https://issues.apache.org/jira/browse/ARROW-3730 where I noted that
> > there's no way (aside from generating the Flatbuffers messages) to
> > generate a schema representation that can be used later to reconstruct
> > a schema in a program. If such a representation were human
> > readable/editable that seems beneficial.
> >
> >
> >
> > [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs
> >
> > On Sat, Dec 7, 2019 at 11:56 AM Maarten Ballintijn 
> > wrote:
> > >
> > >
> > > Is there a syntax specified for schemas?
> > >
> > > Cheers,
> > > Maarten.
> > >
> > >
> > > > On Dec 6, 2019, at 5:01 PM, Micah Kornfield 
> > wrote:
> > > >
> > > > Hi Christian,
> > > > As far as I know no-one is working on a canonical text representation
> > for
> > > > schemas.  A JSON serializer exists for integration test purposes, but
> > > > IMO it shouldn't be relied upon as canonical.
> > > >
> > > > It looks like Flatbuffers supports serialization to/from JSON [1
> > > >  > >],
> > > > using that functionality might be a promising avenue to pursue for a
> > human
> > > > readable schema. I could see adding a helper method someplace under
> > IPC for
> > > > this.  Would that meet your needs?  I think if there are other
> > > > requirements, then a proposal would be welcome.  Ideally, a solution
> > would
> > > > not require additional build/runtime dependencies.
> > > >
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] 

Re: Human-readable version of Arrow Schema?

2019-12-10 Thread Christian Hudon
Micah: I didn't know that Flatbuffers supported serialization to/from JSON,
thanks. That seems like a very good start, at least. I'll aim to create a
draft pull request that at least wires everything up in Arrow so we can
load/save a Schema.fbs instance from/to JSON. At least it'll make it easier
for me to see how Arrow schemas would look in JSON with that.

Otherwise, I'm still gathering requirements internally here. For example,
one thing that would be nice would be to be able to output a JSON Schema
from at least a subset of the Arrow schema. (That way our users could start
by passing around JSON with a given schema, and transition pieces of a
workflow to Arrow as they're ready.) But that part can also be done outside
of the Arrow code, if deemed not relevant to have in the Arrow codebase
itself.

One core requirement for us, however, would be eventual compatibility
between Arrow versions for a given text representation of a schema.
Meaning, if you have a text description of a given Arrow schema, you can
load it into different versions of Arrow and it creates a valid Schema
Flatbuffer description, that Arrow can use. Wes, were you thinking of that,
or of something else, when you wrote "only makes sense if it is offered
without any backward/forward compatibility guarantees"?

For the now, or me, assuming the JSON serialization done by the Flatbuffer
libraries is usable, it seems we have all the pieces to make this happen:
1) The binary Schema.fbs data structures has to be compatible between
different versions of Arrow, otherwise two processes with different Arrow
versions won't be able to interoperate, no?
2) The Flatbuffer <-> JSON serialization supplied by the Flatbuffers
library also has to be compatible between different versions of the
Flatbuffers library, since the main use case seems to be storing
Flatbuffers assets into version control. Breaking changes there will also
be painful to their users.

With these two together, it would seem not too difficult to create a text
representation for Arrow schemas that (at some point) has some
compatibility guarantees, but maybe I'm missing something?

Thanks,

  Christian

Le lun. 9 déc. 2019, à 07 h 00, Wes McKinney  a écrit :

> The only "canonical" representation of schemas at the moment is the
> Flatbuffers data structure [1]
>
> Having a human-readable/parseable text representation I think only
> makes sense if it is offered without any backward/forward
> compatibility guarantees.
>
> Note I had previously opened
> https://issues.apache.org/jira/browse/ARROW-3730 where I noted that
> there's no way (aside from generating the Flatbuffers messages) to
> generate a schema representation that can be used later to reconstruct
> a schema in a program. If such a representation were human
> readable/editable that seems beneficial.
>
>
>
> [1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs
>
> On Sat, Dec 7, 2019 at 11:56 AM Maarten Ballintijn 
> wrote:
> >
> >
> > Is there a syntax specified for schemas?
> >
> > Cheers,
> > Maarten.
> >
> >
> > > On Dec 6, 2019, at 5:01 PM, Micah Kornfield 
> wrote:
> > >
> > > Hi Christian,
> > > As far as I know no-one is working on a canonical text representation
> for
> > > schemas.  A JSON serializer exists for integration test purposes, but
> > > IMO it shouldn't be relied upon as canonical.
> > >
> > > It looks like Flatbuffers supports serialization to/from JSON [1
> > >  >],
> > > using that functionality might be a promising avenue to pursue for a
> human
> > > readable schema. I could see adding a helper method someplace under
> IPC for
> > > this.  Would that meet your needs?  I think if there are other
> > > requirements, then a proposal would be welcome.  Ideally, a solution
> would
> > > not require additional build/runtime dependencies.
> > >
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] See Text & schema parsing
> > > https://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html
> > >
> > > On Fri, Dec 6, 2019 at 1:26 PM Christian Hudon 
> wrote:
> > >
> > >> Hi,
> > >>
> > >> For the uses I would like to make of Arrow, I would need a
> human-readable
> > >> and -writable version of an Arrow Schema, that could be converted to
> and
> > >> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I
> don't
> > >> see anything to that effect, with the closest being the ToString()
> method
> > >> on DataType instances, but which is meant for debugging only. (I need
> an
> > >> expression of an Arrow Schema that people can read, and that can live
> > >> outside of the code for a particular operation.)
> > >>
> > >> Is a text representation of an Arrow Schema something that is being
> worked
> > >> on now? If not, would you folks be interested in me putting up an
> initial
> > >> proposal for discussion? Any design constraints I should pay
> attention to,
> > >> then?
> > >>
> > >> Thanks,
> > >>
> > >>  

Re: Human-readable version of Arrow Schema?

2019-12-09 Thread Wes McKinney
The only "canonical" representation of schemas at the moment is the
Flatbuffers data structure [1]

Having a human-readable/parseable text representation I think only
makes sense if it is offered without any backward/forward
compatibility guarantees.

Note I had previously opened
https://issues.apache.org/jira/browse/ARROW-3730 where I noted that
there's no way (aside from generating the Flatbuffers messages) to
generate a schema representation that can be used later to reconstruct
a schema in a program. If such a representation were human
readable/editable that seems beneficial.



[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs

On Sat, Dec 7, 2019 at 11:56 AM Maarten Ballintijn  wrote:
>
>
> Is there a syntax specified for schemas?
>
> Cheers,
> Maarten.
>
>
> > On Dec 6, 2019, at 5:01 PM, Micah Kornfield  wrote:
> >
> > Hi Christian,
> > As far as I know no-one is working on a canonical text representation for
> > schemas.  A JSON serializer exists for integration test purposes, but
> > IMO it shouldn't be relied upon as canonical.
> >
> > It looks like Flatbuffers supports serialization to/from JSON [1
> > ],
> > using that functionality might be a promising avenue to pursue for a human
> > readable schema. I could see adding a helper method someplace under IPC for
> > this.  Would that meet your needs?  I think if there are other
> > requirements, then a proposal would be welcome.  Ideally, a solution would
> > not require additional build/runtime dependencies.
> >
> >
> > Thanks,
> > Micah
> >
> > [1] See Text & schema parsing
> > https://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html
> >
> > On Fri, Dec 6, 2019 at 1:26 PM Christian Hudon  wrote:
> >
> >> Hi,
> >>
> >> For the uses I would like to make of Arrow, I would need a human-readable
> >> and -writable version of an Arrow Schema, that could be converted to and
> >> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> >> see anything to that effect, with the closest being the ToString() method
> >> on DataType instances, but which is meant for debugging only. (I need an
> >> expression of an Arrow Schema that people can read, and that can live
> >> outside of the code for a particular operation.)
> >>
> >> Is a text representation of an Arrow Schema something that is being worked
> >> on now? If not, would you folks be interested in me putting up an initial
> >> proposal for discussion? Any design constraints I should pay attention to,
> >> then?
> >>
> >> Thanks,
> >>
> >>  Christian
> >> --
> >>
> >>
> >> │ Christian Hudon
> >>
> >> │ Applied Research Scientist
> >>
> >>   Element AI, 6650 Saint-Urbain #500
> >>
> >>   Montréal, QC, H2S 3G9, Canada
> >>   Elementai.com
> >>
>


Re: Human-readable version of Arrow Schema?

2019-12-07 Thread Maarten Ballintijn


Is there a syntax specified for schemas?

Cheers,
Maarten.


> On Dec 6, 2019, at 5:01 PM, Micah Kornfield  wrote:
> 
> Hi Christian,
> As far as I know no-one is working on a canonical text representation for
> schemas.  A JSON serializer exists for integration test purposes, but
> IMO it shouldn't be relied upon as canonical.
> 
> It looks like Flatbuffers supports serialization to/from JSON [1
> ],
> using that functionality might be a promising avenue to pursue for a human
> readable schema. I could see adding a helper method someplace under IPC for
> this.  Would that meet your needs?  I think if there are other
> requirements, then a proposal would be welcome.  Ideally, a solution would
> not require additional build/runtime dependencies.
> 
> 
> Thanks,
> Micah
> 
> [1] See Text & schema parsing
> https://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html
> 
> On Fri, Dec 6, 2019 at 1:26 PM Christian Hudon  wrote:
> 
>> Hi,
>> 
>> For the uses I would like to make of Arrow, I would need a human-readable
>> and -writable version of an Arrow Schema, that could be converted to and
>> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
>> see anything to that effect, with the closest being the ToString() method
>> on DataType instances, but which is meant for debugging only. (I need an
>> expression of an Arrow Schema that people can read, and that can live
>> outside of the code for a particular operation.)
>> 
>> Is a text representation of an Arrow Schema something that is being worked
>> on now? If not, would you folks be interested in me putting up an initial
>> proposal for discussion? Any design constraints I should pay attention to,
>> then?
>> 
>> Thanks,
>> 
>>  Christian
>> --
>> 
>> 
>> │ Christian Hudon
>> 
>> │ Applied Research Scientist
>> 
>>   Element AI, 6650 Saint-Urbain #500
>> 
>>   Montréal, QC, H2S 3G9, Canada
>>   Elementai.com
>> 



Re: Human-readable version of Arrow Schema?

2019-12-06 Thread Micah Kornfield
Hi Christian,
As far as I know no-one is working on a canonical text representation for
schemas.  A JSON serializer exists for integration test purposes, but
IMO it shouldn't be relied upon as canonical.

It looks like Flatbuffers supports serialization to/from JSON [1
],
using that functionality might be a promising avenue to pursue for a human
readable schema. I could see adding a helper method someplace under IPC for
this.  Would that meet your needs?  I think if there are other
requirements, then a proposal would be welcome.  Ideally, a solution would
not require additional build/runtime dependencies.


Thanks,
Micah

[1] See Text & schema parsing
https://google.github.io/flatbuffers/flatbuffers_guide_use_cpp.html

On Fri, Dec 6, 2019 at 1:26 PM Christian Hudon  wrote:

> Hi,
>
> For the uses I would like to make of Arrow, I would need a human-readable
> and -writable version of an Arrow Schema, that could be converted to and
> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> see anything to that effect, with the closest being the ToString() method
> on DataType instances, but which is meant for debugging only. (I need an
> expression of an Arrow Schema that people can read, and that can live
> outside of the code for a particular operation.)
>
> Is a text representation of an Arrow Schema something that is being worked
> on now? If not, would you folks be interested in me putting up an initial
> proposal for discussion? Any design constraints I should pay attention to,
> then?
>
> Thanks,
>
>   Christian
> --
>
>
> │ Christian Hudon
>
> │ Applied Research Scientist
>
>Element AI, 6650 Saint-Urbain #500
>
>Montréal, QC, H2S 3G9, Canada
>Elementai.com
>


Human-readable version of Arrow Schema?

2019-12-06 Thread Christian Hudon
Hi,

For the uses I would like to make of Arrow, I would need a human-readable
and -writable version of an Arrow Schema, that could be converted to and
from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
see anything to that effect, with the closest being the ToString() method
on DataType instances, but which is meant for debugging only. (I need an
expression of an Arrow Schema that people can read, and that can live
outside of the code for a particular operation.)

Is a text representation of an Arrow Schema something that is being worked
on now? If not, would you folks be interested in me putting up an initial
proposal for discussion? Any design constraints I should pay attention to,
then?

Thanks,

  Christian
-- 


│ Christian Hudon

│ Applied Research Scientist

   Element AI, 6650 Saint-Urbain #500

   Montréal, QC, H2S 3G9, Canada
   Elementai.com