Hi everyone,

As part of creating the new arrow-dotnet repository, the contents of the
format directory from the main arrow repository had to be copied [1]. This
contains language agnostic flatbuffer and protobuf definitions for the
Arrow IPC and Flight formats that can be used to generate code. Both the
arrow-rs [2] and arrow-java [3] repositories also contain copies of these
files that have to be manually updated when there are format changes.

It appears that other implementations check in generated code rather than
generate code at build time, so don't need to store the original
definitions (at least arrow-go [4] and arrow-swift [5] do this, I haven't
looked closely at all implementations).

I wonder whether it would simplify processes if there was a shared
arrow-format repository to store these files, which could be included as a
git submodule in other repositories, similar to how the arrow-testing and
parquet-testing repositories are used. This would make it easy to see
whether the format files are up to date, and prevent potential divergence
between implementations.

On the other hand, these format files aren't updated frequently and git
submodules add extra developer friction. They aren't checked out by default
when cloning for example, and changes that cross repository boundaries
require extra coordination.

What do people think of this idea? Would it be worth setting up a new
arrow-format repository?

Thanks,
Adam

[1]: https://github.com/apache/arrow-dotnet/pull/17
[2]: https://github.com/apache/arrow-rs/tree/main/format
[3]: https://github.com/apache/arrow-java/tree/main/arrow-format
[4]:
https://github.com/apache/arrow-go/blob/a661aa4711c27a065907512c69bf2e9d3454b936/arrow/internal/flatbuf/Schema.go#L17
[5]:
https://github.com/apache/arrow-swift/blob/99275981ac54ab25a9f308f6182acf571385bda6/Arrow/Sources/Arrow/Schema_generated.swift#L18

Reply via email to