Re: Human-readable version of Arrow Schema?

Francois Saint-Jacques Thu, 09 Jan 2020 05:48:09 -0800

The desired goal for this feature is trivial modifications, e.g.
within an editor, by data-scientists and researchers.


I'd go for the flatbuffer's json representation as it is stable and
has native support in almost any language or editor due to the
ubiquity of JSON. The C interface schema string representation is
optimized for developers writing parser/codecs and looks like
gibberish to anyone not familiar with python's struct format string.

François


On Wed, Jan 8, 2020 at 8:50 PM Kohei KaiGai <kai...@heterodb.com> wrote:
>
> Hello,
>
> pg2arrow [*1] has '--dump' mode to print out schema definition of the
> given Apache Arrow file.
> Does it make sense for you?
>
> $ ./pg2arrow --dump ~/hoge.arrow
> [Footer]
> {Footer: version=V4, schema={Schema: endianness=little,
> fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
> custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
> children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
> type={Decimal: precision=11, scale=7}, children=[],
> custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
> children=[{Field: name="x", nullable=true, type={Int32}, children=[],
> custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
> children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
> type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
> {Field: name="d", nullable=true, type={Utf8},
> dictionary={DictionaryEncoding: id=0, indexType={Int32},
> isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
> nullable=true, type={Timestamp: unit=us}, children=[],
> custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
> children=[], custom_metadata=[]}, {Field: name="random",
> nullable=true, type={Float64}, children=[], custom_metadata=[]}],
> custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
> FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
> bodyLength=128}], recordBatches=[{Block: offset=1232,
> metaDataLength=648 bodyLength=386112}]}
> [Dictionary Batch 0]
> {Block: offset=920, metaDataLength=184 bodyLength=128}
> {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
> length=6, nodes=[{FieldNode: length=6, null_count=0}],
> buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
> {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
> [Record Batch 0]
> {Block: offset=1232, metaDataLength=648 bodyLength=386112}
> {Message: version=V4, body={RecordBatch: length=3000,
> nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
> length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
> {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
> null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
> length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
> {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
> null_count=0}, {FieldNode: length=3000, null_count=0}],
> buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
> length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
> offset=12416, length=24000}, {Buffer: offset=36416, length=384},
> {Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
> length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
> length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
> offset=97600, length=12032}, {Buffer: offset=109632, length=0},
> {Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
> length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
> offset=217664, length=12032}, {Buffer: offset=229696, length=384},
> {Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
> length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
> offset=266112, length=96000}, {Buffer: offset=362112, length=0},
> {Buffer: offset=362112, length=24000}]}, bodyLength=386112}
>
> [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow
>
> 2019年12月7日(土) 6:26 Christian Hudon <chr...@elementai.com>:
> >
> > Hi,
> >
> > For the uses I would like to make of Arrow, I would need a human-readable
> > and -writable version of an Arrow Schema, that could be converted to and
> > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> > see anything to that effect, with the closest being the ToString() method
> > on DataType instances, but which is meant for debugging only. (I need an
> > expression of an Arrow Schema that people can read, and that can live
> > outside of the code for a particular operation.)
> >
> > Is a text representation of an Arrow Schema something that is being worked
> > on now? If not, would you folks be interested in me putting up an initial
> > proposal for discussion? Any design constraints I should pay attention to,
> > then?
> >
> > Thanks,
> >
> >   Christian
> > --
> >
> >
> > │ Christian Hudon
> >
> > │ Applied Research Scientist
> >
> >    Element AI, 6650 Saint-Urbain #500
> >
> >    Montréal, QC, H2S 3G9, Canada
> >    Elementai.com
>
>
>
> --
> HeteroDB, Inc / The PG-Strom Project
> KaiGai Kohei <kai...@heterodb.com>

Re: Human-readable version of Arrow Schema?

Reply via email to