Hi, Thanks for your proposal!
In <camexywdtj_k9_xgycdcprbpufdj5xbsmb4fnmykobeg+6oa...@mail.gmail.com> "[DISCUSS][Erlang] Erlang Apache Arrow Implementation" on Fri, 8 Aug 2025 21:41:54 +0530, Benjamin Philip <benjamin.philip...@gmail.com> wrote: > I am working on an Erlang implementation for Apache Arrow, and I am > interested in submitting it to the Apache Foundation as an official > implementation for Erlang and Elixir, once it is ready. If we develop it as out of the Apache Software Foundation (https://github.com/apache/), we need to donate it to the Apache Software Foundation. See also https://incubator.apache.org/ip-clearance/ and David's reply. In the donation process, all copyright holders must sign contributor license agreement. See also: https://www.apache.org/licenses/contributor-agreements.html If we create a new repository such as https://github.com/apache/arrow-erlang and develop in it from scratch, we don't need the donation process. > Initial work[4] was started 2 years ago for compliance with some new > OpenTelemetry specifications. However, my focus so far has only been > (de)serialization and not operating on/manipulating Arrow Arrays since that > was the only requirement in OpenTelemetry. > > The trouble with Erlang, is that natively producing and decoding binaries > in pure Erlang is more effective than through a C FFI. This has also been > the case with plaintext formats like JSON and XML, and with parsing markup > like HTML and Markdown. This has meant that we've had to write an Erlang > Arrow implementation from the ground up. The lack of an Erlang flatbuffer > implementation (for IPC), SIMD support in the Erlang Virtual Machine (for > efficient operations) and mutability (for zero-copy access; all values in > Erlang are immutable) make a complete Arrow implementation in Erlang > especially challenging. Could you upstream the FlatBuffers part to https://github.com/google/flatbuffers instead of maintaining it by us? See also the David's reply. > An alternative could be to handle serializations in Erlang and operations > with the C bindings. We could also start with a minimal implementation with > bindings to nanoarrow and deprecate that in favour of the Erlang one later. It seems that there are Rust codes in your repository: https://github.com/Benjamin-Philip/serde_arrow/tree/main/native/arrow_format_nif How about starting a new implementation as arrow-rs bindings? nanoarrow provides only serialization/deserialization features but arrow-rs provides more features such as computation features. > Upstreaming a fully compliant Erlang implementation could potentially be a > multi-year project. This might also include writing an Erlang flatbuffers > implementation. This will also be an additional implementation for the > Arrow team to maintain, though I would be happy to aid in developing and > maintaining it. What are the steps to get this going? See the above comment from me. > How are implementations out of the mono repo tested? Is there any guide for > setting up integration testing and benchmarking in third-party > implementations? So far I've had to roll my own minimal tooling for what > archery supports, and I would prefer if I could integrate with > archery instead. We need to add a tester to Archery. For example, https://github.com/apache/arrow/blob/main/dev/archery/archery/integration/tester_go.py is a tester for Go. FYI: There is a PR that adds a generic tester for implementations not in apache/arrow: https://github.com/apache/arrow/pull/46530 > Additionally, the initial work for this project was sponsored by the Erlang > Ecosystem Foundation[5]. Would this be an issue when transferring > stewardship to the ASF? If the Erlang Ecosystem Foundation is also a copyright holder and we choose the donation process, the Erlang Ecosystem Foundation also needs sign contributor license agreement. Thanks, -- kou