westonpace commented on issue #34898:
URL: https://github.com/apache/arrow/issues/34898#issuecomment-1503741977

   The Arrow IPC format defines how to serialize a schema.  You can use an 
empty IPC file to represent a schema in a portable way:
   
   ```
   >>> import pyarrow as pa
   >>> import pyarrow.ipc
   >>> my_schema = pa.schema([pa.field("x", pa.timestamp("ms")), pa.field("y", 
pa.int32())])
   >>> table = pa.Table.from_batches([], schema=my_schema)
   >>> with pyarrow.ipc.RecordBatchFileWriter("/tmp/schema.arrow", 
schema=my_schema) as writer:
   ...   writer.write_table(table)
   ... 
   >>> with pyarrow.ipc.RecordBatchFileReader("/tmp/schema.arrow") as reader:
   ...   new_schema = reader.read_all().schema
   ... 
   >>> new_schema
   x: timestamp[ms]
   y: int32
   ```
   
   This is not human editable.  I agree that having a human editable format can 
be useful.  As @danepitkin pointed out, there are maintenance concerns.
   
   One close solution could be to eventually adopt the Substrait text format 
though:
    * There is no top level message for "just a schema" so you'd have to embed 
it in a dummy plan
    * The text format is not yet ready and there are no python bindings (it's 
in progress but still a few months out I'd guess)
    * The type systems aren't exactly the same (e.g. "unsigned integer types" 
are a user defined type)
   
   ```
   schema my_schema {
     r_regionkey i32;
     r_name string;
     r_comment string;
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to