Hi all, I'd like to bring up an idea from a recent thread ([1]) about moving the `format/` directory out of the primary apache/arrow repository.
I understand from that thread there are some concerns about using submodules, and I definitely sympathize with those concerns. In talking with David Li (disclaimer: we work together at Voltron Data), he has a great idea that I think makes everyone happy: an `apache/arrow-format` repository that is the official mirror for the flatbuffers IDL, that library authors should use as the source of truth. It doesn't require a submodule, yet it also allows external projects the ability to access the IDL without having to interact with the main arrow repository and is backwards compatible to boot. In this scenario, repositories that are currently copying in the flatbuffers IDL can migrate to this repository at their leisure. My motivation for this was around sharing data structures for the compute IR proposal ([2]). I can think of at least two ways for IR producers and consumers of all languages to share the flatbuffers IDL: 1. A set of bindings built in some language that other languages can integrate with, likely C++, that allows library users to build IR using an API. The primary downside to this is that we'd have to deal with building another library while working out any kinks in the IR design and I'd rather avoid that in the initial phases of this project. The benefit is that IR components don't interact much with `flatbuffers` or `flatc` directly. 2. A single location where the format lives, that doesn't require depending on a large multi-language repository to access a handful of files. I think the downside to this is that there's a bit of additional infrastructure to automate copying in `arrow-format`. The benefit there is that producers and consumers can immediately start getting value from compute IR without having to wait for development of a new API. One counter-proposal might be to just put the compute IR IDL in a separate repo, but that isn't tenable because the compute IR needs arrow's type information contained in `Schema.fbs`. I was hoping to avoid conflating the discussion about bindings vs direct flatbuffer usage (at least initially just supporting one, I predict we'll need both ultimately) with the decision about whether to split out the format directory, but it's a good example of a choice for which splitting out the format directory would be well-served. I'll note that this doesn't block anything on the compute IR side, just wanted to surface this in a parallel thread and see what folks think. [1]: https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E [2]: https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l