HadrienG2 opened a new issue, #5700:
URL: https://github.com/apache/arrow-rs/issues/5700
# Motivation
There are a number of things that make using parquet+arrow feel like writing
Python in Rust syntax today:
- Everywhere one needs to deal with `&dyn Any`s that must be downcasted to
concrete types that can only be known by checking the documentation.
- Everything is nullable, often without an easy way to opt out of
nullability.
- Every offset is `i32` or `i64`, even in situations where there is no sane
use case for negative offsets.
- API methods are not orthogonal, for example if you want to insert an entry
in a `MapBuilder` you must insert a key with one function, insert a value with
a different function, and insert an entry with a third function. Forget any of
these, or insert unrelated calls in the middle, and either you will get a panic
or it will not do what you want.
- More generally, to get to the thing you want when reading a file, you
often need to go through multiple layers of faillible downcasting of weak types
to strong types, when ideally you would want instead to specify the type schema
early on, check for errors there, and be done with it.
Because of this, the API of the arrow and parquet crates often feels very
foreign and frustrating to the Rust developer that is accustomed to strongly
typed I/O abstractions like serde. As a result, in a recent project of mine
that uses parquet for disk I/O, I ended up choosing to reimplementing a
columnar in-memory data format myself, which is only converted to arrow arrays
at I/O time, simply because the resulting ergonomics were so much better. In my
opinion, this is a failure of the current arrow API design.
While some of this weak typing frustration is sometimes necessary (e.g. I
don't think we can allow a user to open and analyze arbitrary parquet files of
unknown content without it), it seems to me that in almost every use case,
people actually know what is the type of the columns of the files they are
reading and writing, and would be willing to write a tiny amount of code to
specify it at the time where files are opened or arrays are created, in
exchange for the convenience of dealing away with most of the above ceremony.
# Proposal
To avoid making the initial post of this issue excessively long, I'll only
cover array creation, since that is where arrow is the closest to strong typing
today, with structs being the main pain point, and thus the explanation is
quickest. However, I believe the API design that I'm describing can scale to
all other arrow operations, including e.g. readout of parquet files.
Today, arrow needs about 100 entities (types + typedefs...) in
`arrow::builder` to achieve some degree strong typing, and still does so at the
expense of hard-to-use struct support. Instead, I believe better overall
ergonomics and struct support could be achieved with a single `TypedBuilder<T>`
type, at the expense of accepting the usability tradeoff that not all
`TypedBuilder` methods are available for all type parameters `T` (a Rust API
design feature whose use is not terribly common, but still occasionally seen in
e.g. the standard library).
Let's first see how this works with primitive types and collections of
primitive types first, then I'll build upon it to explain how this formalism
can be expanded to greatly improve the support for structs in arrow.
## Primitive types and nullability
First, given `P` a primitive type, the API of `TypedBuilder<P>` is largely
equivalent to that of the existing `PrimitiveBuilder`, but with support for
nulls disabled (there is no way to insert a null in the array being built).
Since we're going through the trouble of redesigning the API, though, I
would suggest using the opportunity to switch to "push" instead of "append" as
a verb for adding things, because that terminology is universal in Rust's
array-like types and arrow-rs is the outlier here.
If you want support for nulls, you can use `TypedBuilder<Option<P>>`
instead, which allows for nulls via the standard `Some`/`None` mechanism. So
far, this is all very easy.
## Lists
Want an array of lists of primitive types? You can use
`TypedBuilder<List<P>>`. In this flavor of the `TypedBuilder` API, the method
for inserting a single element is replaced with three methods:
* A `push(&mut self, &[P])` method that appends a new list to the array in a
single transaction, orthogonal to other methods. But it requires you to have a
pre-made slice, which is not always the case, and therefore you also have...
* A `start_push(&mut self) -> ListBuilder<'_, P>` that returns an RAII guard
that you can use to push elements to the list one by one. While this RAII guard
is active, other methods of `TypedBuilder<List<P>>`, including other calls to
`start_push`, cannot be used. Once the RAII guard is dropped, it terminates the
current list like the existing
`arrow::builder::GenericListBuilder::append(true)` workflow.
Fixed-sized lists are more interesting because this notion can be mapped
into the Rust type system in two different ways:
- One flavor where the size of the list is known at compile time, which we
could model as `TypedBuilder<FixedList<P, Const<N>>`. In this flavor of the
fixed list API, we can have methods with verify at compile time that the
sublists that are being pushed have the right lenght at compile time, using API
designs like `push(&mut self, &[P; N])`.
- One flavor where the size of the list is not known at compile time, which
we could model as `TypedBuilder<FixedList<P, Dyn>>`, or even
`TypedBuilder<FixedList<P>>` if we make `Dyn` a default type parameter of
`FixedList`. This flavor of the fixed list API mostly looks and feels like the
dynamically sized `TypedBuilder<List<P>>` flavor, except the size of sublists
is specified at construction time and checked at insertion time, as in the
existing `arrow::builder::FixedSizeListBuilder`.
## Structs
Finally, we get to the point where I can talk about structs, which I believe
are the biggest improvement provided by this API redesign.
So far, the arrow-rs API has largely modeled structs as vectors of fields in
the Rust type system. This is maximally flexible, because it allows structs to
have a number of fields that is not known at compile time. But I believe that
in >90% of use cases, this flexibility is not needed, and users would be able
to do what they want with an API that models structs as tuples of types whose
length is known at compile time. What such an API would give us being, of
course, better integration into a strongly types programming language like Rust.
Unfortunately, this is also the point where we start to hit more serious
limitations of the Rust type system, such as lack of variadic generics which
prevents the existence of `Struct<T, U, V, ...>` type. Nothing that can't be
worked around with tuples, though.
### Primitive members
Let's start easy with structs whose members are all primitive types. In the
proposed API, this would be modeled as a typed builder of tuples, i.e.
`TypedBuilder<(P1, P2, P3)>`.
As you would expect, this typed builder would have a `push(&mut self, (P1,
P2, P3))` method that takes a tuple of elements. However, as you may know, this
can easily result in bad LLVM codegen if many elements need to be inserted.
Therefore, for performance-sensitive use cases, we'll also likely want to
have an alternate `push_slices(&mut self, (&[P1], &[P2], &[P3]))` bulk
insertion method which checks that all slices have the same length and then
forwards the slices to the underlying primitive array builders.
### Non-primitive members
The `push(&mut self, (P1, P2, P3))` API design can scale to any type for
which we can devise an element-wise `push()` method. Which, as we've seen, is
pretty much always the case. However, it will not always scale **efficiently**.
As I've alluded to above, if you're starting from columnar data, repeatedly
pushing tuples of elements is often less efficient than bulk-pushing tuples of
slices of elements. That's an LLVM limitation that will hopefully be lifted
someday. But what will likely never be efficient on any compiler backend is
when the struct fields have inner structure themselves. Think nullable fields,
or tuples of lists of tuples.
This inefficiency may not a problem if the eventual goal is to write down
the data to disk, because storage is so much slower than RAM that the extra
overhead of reshuffling data in RAM is invisible. However, it can become
unacceptable if the freshly built array is to be immediately used for in-RAM
computations.
In this scenario, what we will likely want is for `TypedBuilder<(T, U, V)>`
to have is a `start_pushes(&mut self) -> (FieldBuilder<'_, T>, FieldBuilder<'_,
U>, FieldBuilder<'_, V>)` method which transforms the builder of tuples into a
tuple of builder of fields, essentially exposing the way a struct builder is
implemented under the hood.
The resulting `FieldBuilder`s, would mostly behave like a `TypedBuilder` of
the corresponding field type, but also act as an RAII guard that, when dropped,
would perform the following validation actions:
- If it is the first FieldBuilder to be dropped, record the number of
elements that were pushed into the underlying StructBuilder.
- If it is not the first FieldBuilder to be dropped, check that the same
number of elements were pushed as into the other FieldBuilders, otherwise panic.
Since panics in destructors are frowned upon, there would obviously also be
a `FieldBuilder::finish(self)` method that users would be strongly encouraged
to use instead of dropping the field builder.
### User-defined structs?
Some may have noticed that I have taken a slight logical shortcut by saying
that arrow structs should be modeled as Rust tuples. Wouldn't the most logical
design be to model them as Rust structs? This is actually true, and that would
give us, on top of the typing guarantees of tuple builders...
- Automatic StructArray column names that repeat the names of struct fields.
- In future rust where const generics are more powerful, we could also
envision things like a `const field<const NAME: &str, T>(&mut self) ->
FieldBuilder<'_, T>` that is checked at compile time.
However, since Rust doesn't have compile time reflection today, getting
there would require implementing some derive macros. That's a big undertaking,
for what is today a small benefit, so I don't think it should be done just for
struct building support. But we'll come back to this question when we discuss
unions.
## Maps
Map arrays are basically arrays of lists of (key, value) tuples. Therefore,
naive support for them can be provided using the above building blocks, with no
extra supporting API.
However, it would be more ergonomic for users if we also provided
interoperability with built-in rust map types in the form of a `push_map(&mut
self, map: impl IntoIterator<Item = (K, V)>)` method. With this, users would
get a way to push a `HashMap` or a `BTreeMap` into the map array which may not
be optimally efficient from a RAM bandwidth usage perspective, but would be
optimally easy and fast enough for disk I/O use cases. Choices are good!
## Unions
What could change my mind on derive macros, however, is the need to also
provide support for unions in the `TypedBuilder` API.
If arrow structs map to Rust structs, then arrow unions map to Rust enums.
However, while the Rust type system has a standard anonymous struct type today
in the form of tuples, it does not have a standard anonymous enum type. It
wouldn't be too hard to create one, but due to lack of variadic generics, it
would have an awkward `Union<(T, U, V)>` signature, and using it would involve
horrible visitor APIs in the style of C++'s `std::variant`.
Much better ergonomics could be achieved if users could instead slap a
`#[derive(Union)]` proc macro on an enum declarations and automatically
generate all support code needed for an array of unions whose variants match
the Rust enum's variants. Although this does require writing a proc macro...
What the proc macro approach would give us is, of course, the ability to
accept enum values in `TypedBuilder<Enum>::push(&mut self, Enum)`.
# Alternatives
## Statu quo
Even if I hate using it, I will readily admit that there are good reasons
why the `arrow-rs` API currently looks the way it does:
- It's more general. Very advanced use cases like files with dynamically
known numbers of array columns, or structs with dynamically known number of
fields, cannot be supported without a dynamically typed approach. Because of
this, the proposed `TypedBuilder` API can only be a higher-level complement of
the current APIs, not a full replacement.
- It's easier to implement. Although much easier from a user's perspective,
the proposed `TypedBuilder` API will require a complex trait-based dispatch
machinery in order to ultimately offload user requests into the equivalent of
today's columnar builders.
As a result, sticking with the statu quo is an obvious alternative to what
I'm proposing here. The price to pay is that users will not be able to use
arrow in scenarios where it should fit their needs, or only at the expense of
experiencing very painful API ergonomics.
## Dedicated builder types
Even if we do agree that an API with stronger typing is desirable and
worthwhile, then the API design I'm proposing is not the only way to do it.
For example, we could keep the current design principle of having dedicated
builder types for most element types (MapBuilder, ListBuilder...) to make
implementation easier and avoid the awkward API design of having
`TypedBuilder<T>` methods depend on what `T` is.
This would, however, come at the expense of having plenty of builder types
(which I personally found quite confusing when I started learning arrow) and
also at the expense of making structs feel more foreign and awkward compared to
other types. As far as I know, there is no way to do strongly typed structs in
today's Rust without some variation of the tuple-based design that I am
proposing.
## More macros
The design that I am proposing purposely uses a minimal amount of procedural
macros. Procedural macros are hard to write, hard to test, hard to maintain,
hard for users to reason about, and hard on IDEs. In my opinion, they should
therefore be a last-resort tool that one only uses for when no good alternative
is available.
But with all that being said, there is a clear argument to be made that if
we're starting to build a mapping between (a subset of) the Rust type system
and the Arrow data model, we could go further an map a larger subset of the
Rust type system by allowing for user-defined types, like structs and enums.
I briefly alluded to this possibility above, but only touched the surface of
what is possible. We could go a lot further with macros, for example by turning
some user struct `Foo { bar: Bar }` into a dedicated `FooBuilder` variant of
`TypedBuilder` design, which features tailor-made API adaptations for the `Foo`
struct like a `bar_field(&mut self) -> BarBuilder<'_>` method.
The benefits for users would be easier manipulation of individual struct
fields and avoidance of humonguous deeply nested generic types. The drawbacks
would be enormous implementation complexity on the arrow-rs side, and a more
difficult API documentation story around macro-generated types.
I'm not fully rejecting this idea, but I think that it is not first-priority
because
1. Because of orphan rules, we're going to need something like
`TypedBuilder` to cleanly support std types anyway.
2. Making `TypedBuilder` work will provide plenty of valuable design and
implementation lessons that will come in very handy if the need for a builder
generator ever arises.
# Additional context
This API design proposal comes with the important disclaimer that I have not
started any implementation work, not even a prototype, because I wanted to see
how open you were to the general idea before investing significant time and
effort into making it work.
As a result, there may be unforeseen implementation difficulties, or design
problems that emerge only in more advanced use cases than those which I have
thought about while drafting this on virtual paper. If you decide that this
feature is worth having, it may evolve significantly from implementation
feedback before it is eventually shipped to users.
There are also entire aspects of the proposed API that I am handwaving away
as "basically the same idea, but applied in a different context", for example
how reading strongly typed data from a `RecordBatch` would work. I did this so
this issue would be shorter, and thus easier for me to write and for you to
read. But it may also mean that there are design problems in this area that
will only become apparent as I start working on it.
If you're interested, I can try to write a more formal and complete spec to
reduce the amount of unexplored design regions. I just preferred not to do so
before knowing if there is any interest from your side.
Finally, when it comes to manpower, I'm a lone developer working on my spare
time with plenty of others already running personal projects competing for my
time. So if I end up being the one who implements this feature, it may take a
long while to see the light of the day, if at all. Implementation assistance
from other interested people with more time/financial resources would therefore
be much appreciated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]