HadrienG2 opened a new issue, #5700:
URL: https://github.com/apache/arrow-rs/issues/5700

   # Motivation
   
   There are a number of things that make using parquet+arrow feel like writing 
Python in Rust syntax today:
   
   - Everywhere one needs to deal with `&dyn Any`s that must be downcasted to 
concrete types that can only be known by checking the documentation.
   - Everything is nullable, often without an easy way to opt out of 
nullability.
   - Every offset is `i32` or `i64`, even in situations where there is no sane 
use case for negative offsets.
   - API methods are not orthogonal, for example if you want to insert an entry 
in a `MapBuilder` you must insert a key with one function, insert a value with 
a different function, and insert an entry with a third function. Forget any of 
these, or insert unrelated calls in the middle, and either you will get a panic 
or it will not do what you want.
   - More generally, to get to the thing you want when reading a file, you 
often need to go through multiple layers of faillible downcasting of weak types 
to strong types, when ideally you would want instead to specify the type schema 
early on, check for errors there, and be done with it.
   
   Because of this, the API of the arrow and parquet crates often feels very 
foreign and frustrating to the Rust developer that is accustomed to strongly 
typed I/O abstractions like serde. As a result, in a recent project of mine 
that uses parquet for disk I/O, I ended up choosing to reimplementing a 
columnar in-memory data format myself, which is only converted to arrow arrays 
at I/O time, simply because the resulting ergonomics were so much better. In my 
opinion, this is a failure of the current arrow API design.
   
   While some of this weak typing frustration is sometimes necessary (e.g. I 
don't think we can allow a user to open and analyze arbitrary parquet files of 
unknown content without it), it seems to me that in almost every use case, 
people actually know what is the type of the columns of the files they are 
reading and writing, and would be willing to write a tiny amount of code to 
specify it at the time where files are opened or arrays are created, in 
exchange for the convenience of dealing away with most of the above ceremony.
   
   
   # Proposal
   
   To avoid making the initial post of this issue excessively long, I'll only 
cover array creation, since that is where arrow is the closest to strong typing 
today, with structs being the main pain point, and thus the explanation is 
quickest. However, I believe the API design that I'm describing can scale to 
all other arrow operations, including e.g. readout of parquet files.
   
   Today, arrow needs about 100 entities (types + typedefs...) in 
`arrow::builder` to achieve some degree strong typing, and still does so at the 
expense of hard-to-use struct support. Instead, I believe better overall 
ergonomics and struct support could be achieved with a single `TypedBuilder<T>` 
type, at the expense of accepting the usability tradeoff that not all 
`TypedBuilder` methods are available for all type parameters `T` (a Rust API 
design feature whose use is not terribly common, but still occasionally seen in 
e.g. the standard library).
   
   Let's first see how this works with primitive types and collections of 
primitive types first, then I'll build upon it to explain how this formalism 
can be expanded to greatly improve the support for structs in arrow.
   
   
   ## Primitive types and nullability
   
   First, given `P` a primitive type, the API of `TypedBuilder<P>` is largely 
equivalent to that of the existing `PrimitiveBuilder`, but with support for 
nulls disabled (there is no way to insert a null in the array being built).
   
   Since we're going through the trouble of redesigning the API, though, I 
would suggest using the opportunity to switch to "push" instead of "append" as 
a verb for adding things, because that terminology is universal in Rust's 
array-like types and arrow-rs is the outlier here.
   
   If you want support for nulls, you can use `TypedBuilder<Option<P>>` 
instead, which allows for nulls via the standard `Some`/`None` mechanism. So 
far, this is all very easy.
   
   
   ## Lists
   
   Want an array of lists of primitive types? You can use 
`TypedBuilder<List<P>>`. In this flavor of the `TypedBuilder` API, the method 
for inserting a single element is replaced with three methods:
   
   * A `push(&mut self, &[P])` method that appends a new list to the array in a 
single transaction, orthogonal to other methods. But it requires you to have a 
pre-made slice, which is not always the case, and therefore you also have...
   * A `start_push(&mut self) -> ListBuilder<'_, P>` that returns an RAII guard 
that you can use to push elements to the list one by one. While this RAII guard 
is active, other methods of `TypedBuilder<List<P>>`, including other calls to 
`start_push`, cannot be used. Once the RAII guard is dropped, it terminates the 
current list like the existing 
`arrow::builder::GenericListBuilder::append(true)` workflow.
   
   Fixed-sized lists are more interesting because this notion can be mapped 
into the Rust type system in two different ways:
   
   - One flavor where the size of the list is known at compile time, which we 
could model as `TypedBuilder<FixedList<P, Const<N>>`. In this flavor of the 
fixed list API, we can have methods with verify at compile time that the 
sublists that are being pushed have the right lenght at compile time, using API 
designs like `push(&mut self, &[P; N])`.
   - One flavor where the size of the list is not known at compile time, which 
we could model as `TypedBuilder<FixedList<P, Dyn>>`, or even 
`TypedBuilder<FixedList<P>>` if we make `Dyn` a default type parameter of 
`FixedList`. This flavor of the fixed list API mostly looks and feels like the 
dynamically sized `TypedBuilder<List<P>>` flavor, except the size of sublists 
is specified at construction time and checked at insertion time, as in the 
existing `arrow::builder::FixedSizeListBuilder`.
   
   
   ## Structs
   
   Finally, we get to the point where I can talk about structs, which I believe 
are the biggest improvement provided by this API redesign.
   
   So far, the arrow-rs API has largely modeled structs as vectors of fields in 
the Rust type system. This is maximally flexible, because it allows structs to 
have a number of fields that is not known at compile time. But I believe that 
in >90% of use cases, this flexibility is not needed, and users would be able 
to do what they want with an API that models structs as tuples of types whose 
length is known at compile time. What such an API would give us being, of 
course, better integration into a strongly types programming language like Rust.
   
   Unfortunately, this is also the point where we start to hit more serious 
limitations of the Rust type system, such as lack of variadic generics which 
prevents the existence of `Struct<T, U, V, ...>` type. Nothing that can't be 
worked around with tuples, though.
   
   ### Primitive members
   
   Let's start easy with structs whose members are all primitive types. In the 
proposed API, this would be modeled as a typed builder of tuples, i.e. 
`TypedBuilder<(P1, P2, P3)>`.
   
   As you would expect, this typed builder would have a `push(&mut self, (P1, 
P2, P3))` method that takes a tuple of elements. However, as you may know, this 
can easily result in bad LLVM codegen if many elements need to be inserted.
   
   Therefore, for performance-sensitive use cases, we'll also likely want to 
have an alternate `push_slices(&mut self, (&[P1], &[P2], &[P3]))` bulk 
insertion method which checks that all slices have the same length and then 
forwards the slices to the underlying primitive array builders.
   
   ### Non-primitive members
   
   The `push(&mut self, (P1, P2, P3))` API design can scale to any type for 
which we can devise an element-wise `push()` method. Which, as we've seen, is 
pretty much always the case. However, it will not always scale **efficiently**.
   
   As I've alluded to above, if you're starting from columnar data, repeatedly 
pushing tuples of elements is often less efficient than bulk-pushing tuples of 
slices of elements. That's an LLVM limitation that will hopefully be lifted 
someday. But what will likely never be efficient on any compiler backend is 
when the struct fields have inner structure themselves. Think nullable fields, 
or tuples of lists of tuples.
   
   This inefficiency may not a problem if the eventual goal is to write down 
the data to disk, because storage is so much slower than RAM that the extra 
overhead of reshuffling data in RAM is invisible. However, it can become 
unacceptable if the freshly built array is to be immediately used for in-RAM 
computations.
   
   In this scenario, what we will likely want is for `TypedBuilder<(T, U, V)>` 
to have is a `start_pushes(&mut self) -> (FieldBuilder<'_, T>, FieldBuilder<'_, 
U>, FieldBuilder<'_, V>)` method which transforms the builder of tuples into a 
tuple of builder of fields, essentially exposing the way a struct builder is 
implemented under the hood.
   
   The resulting `FieldBuilder`s, would mostly behave like a `TypedBuilder` of 
the corresponding field type, but also act as an RAII guard that, when dropped, 
would perform the following validation actions:
   
   - If it is the first FieldBuilder to be dropped, record the number of 
elements that were pushed into the underlying StructBuilder.
   - If it is not the first FieldBuilder to be dropped, check that the same 
number of elements were pushed as into the other FieldBuilders, otherwise panic.
   
   Since panics in destructors are frowned upon, there would obviously also be 
a `FieldBuilder::finish(self)` method that users would be strongly encouraged 
to use instead of dropping the field builder.
   
   ### User-defined structs?
   
   Some may have noticed that I have taken a slight logical shortcut by saying 
that arrow structs should be modeled as Rust tuples. Wouldn't the most logical 
design be to model them as Rust structs? This is actually true, and that would 
give us, on top of the typing guarantees of tuple builders...
   
   - Automatic StructArray column names that repeat the names of struct fields.
   - In future rust where const generics are more powerful, we could also 
envision things like a `const field<const NAME: &str, T>(&mut self) -> 
FieldBuilder<'_, T>` that is checked at compile time.
   
   However, since Rust doesn't have compile time reflection today, getting 
there would require implementing some derive macros. That's a big undertaking, 
for what is today a small benefit, so I don't think it should be done just for 
struct building support. But we'll come back to this question when we discuss 
unions.
   
   
   ## Maps
   
   Map arrays are basically arrays of lists of (key, value) tuples. Therefore, 
naive support for them can be provided using the above building blocks, with no 
extra supporting API.
   
   However, it would be more ergonomic for users if we also provided 
interoperability with built-in rust map types in the form of a `push_map(&mut 
self, map: impl IntoIterator<Item = (K, V)>)` method. With this, users would 
get a way to push a `HashMap` or a `BTreeMap` into the map array which may not 
be optimally efficient from a RAM bandwidth usage perspective, but would be 
optimally easy and fast enough for disk I/O use cases. Choices are good!
   
   
   ## Unions
   
   What could change my mind on derive macros, however, is the need to also 
provide support for unions in the `TypedBuilder` API.
   
   If arrow structs map to Rust structs, then arrow unions map to Rust enums. 
However, while the Rust type system has a standard anonymous struct type today 
in the form of tuples, it does not have a standard anonymous enum type. It 
wouldn't be too hard to create one, but due to lack of variadic generics, it 
would have an awkward `Union<(T, U, V)>` signature, and using it would involve 
horrible visitor APIs in the style of C++'s `std::variant`.
   
   Much better ergonomics could be achieved if users could instead slap a 
`#[derive(Union)]` proc macro on an enum declarations and automatically 
generate all support code needed for an array of unions whose variants match 
the Rust enum's variants. Although this does require writing a proc macro...
   
   What the proc macro approach would give us is, of course, the ability to 
accept enum values in `TypedBuilder<Enum>::push(&mut self, Enum)`.
   
   
   
   # Alternatives
   
   ## Statu quo
   
   Even if I hate using it, I will readily admit that there are good reasons 
why the `arrow-rs` API currently looks the way it does:
   
   - It's more general. Very advanced use cases like files with dynamically 
known numbers of array columns, or structs with dynamically known number of 
fields, cannot be supported without a dynamically typed approach. Because of 
this, the proposed `TypedBuilder` API can only be a higher-level complement of 
the current APIs, not a full replacement.
   - It's easier to implement. Although much easier from a user's perspective, 
the proposed `TypedBuilder` API will require a complex trait-based dispatch 
machinery in order to ultimately offload user requests into the equivalent of 
today's columnar builders.
   
   As a result, sticking with the statu quo is an obvious alternative to what 
I'm proposing here. The price to pay is that users will not be able to use 
arrow in scenarios where it should fit their needs, or only at the expense of 
experiencing very painful API ergonomics.
   
   ## Dedicated builder types
   
   Even if we do agree that an API with stronger typing is desirable and 
worthwhile, then the API design I'm proposing is not the only way to do it.
   
   For example, we could keep the current design principle of having dedicated 
builder types for most element types (MapBuilder, ListBuilder...) to make 
implementation easier and avoid the awkward API design of having 
`TypedBuilder<T>` methods depend on what `T` is.
   
   This would, however, come at the expense of having plenty of builder types 
(which I personally found quite confusing when I started learning arrow) and 
also at the expense of making structs feel more foreign and awkward compared to 
other types. As far as I know, there is no way to do strongly typed structs in 
today's Rust without some variation of the tuple-based design that I am 
proposing.
   
   ## More macros
   
   The design that I am proposing purposely uses a minimal amount of procedural 
macros. Procedural macros are hard to write, hard to test, hard to maintain, 
hard for users to reason about, and hard on IDEs. In my opinion, they should 
therefore be a last-resort tool that one only uses for when no good alternative 
is available.
   
   But with all that being said, there is a clear argument to be made that if 
we're starting to build a mapping between (a subset of) the Rust type system 
and the Arrow data model, we could go further an map a larger subset of the 
Rust type system by allowing for user-defined types, like structs and enums.
   
   I briefly alluded to this possibility above, but only touched the surface of 
what is possible. We could go a lot further with macros, for example by turning 
some user struct `Foo { bar: Bar }` into a dedicated `FooBuilder` variant of 
`TypedBuilder` design, which features tailor-made API adaptations for the `Foo` 
struct like a `bar_field(&mut self) -> BarBuilder<'_>` method.
   
   The benefits for users would be easier manipulation of individual struct 
fields and avoidance of humonguous deeply nested generic types. The drawbacks 
would be enormous implementation complexity on the arrow-rs side, and a more 
difficult API documentation story around macro-generated types.
   
   I'm not fully rejecting this idea, but I think that it is not first-priority 
because
   
   1. Because of orphan rules, we're going to need something like 
`TypedBuilder` to cleanly support std types anyway.
   2. Making `TypedBuilder` work will provide plenty of valuable design and 
implementation lessons that will come in very handy if the need for a builder 
generator ever arises.
   
   
   
   # Additional context
   
   This API design proposal comes with the important disclaimer that I have not 
started any implementation work, not even a prototype, because I wanted to see 
how open you were to the general idea before investing significant time and 
effort into making it work.
   
   As a result, there may be unforeseen implementation difficulties, or design 
problems that emerge only in more advanced use cases than those which I have 
thought about while drafting this on virtual paper. If you decide that this 
feature is worth having, it may evolve significantly from implementation 
feedback before it is eventually shipped to users.
   
   There are also entire aspects of the proposed API that I am handwaving away 
as "basically the same idea, but applied in a different context", for example 
how reading strongly typed data from a `RecordBatch` would work. I did this so 
this issue would be shorter, and thus easier for me to write and for you to 
read. But it may also mean that there are design problems in this area that 
will only become apparent as I start working on it.
   
   If you're interested, I can try to write a more formal and complete spec to 
reduce the amount of unexplored design regions. I just preferred not to do so 
before knowing if there is any interest from your side.
   
   Finally, when it comes to manpower, I'm a lone developer working on my spare 
time with plenty of others already running personal projects competing for my 
time. So if I end up being the one who implements this feature, it may take a 
long while to see the light of the day, if at all. Implementation assistance 
from other interested people with more time/financial resources would therefore 
be much appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to