Re: [I] Proposal: Alternate APIs with stronger typing [arrow-rs]

via GitHub Mon, 29 Apr 2024 03:46:16 -0700


HadrienG2 commented on issue #5700:
URL: https://github.com/apache/arrow-rs/issues/5700#issuecomment-2082403812

Thanks for your reply!

> I believe the narrow crate has some similar goals and might be worth
checking out.

I may eventually be taking this to narrow instead, then, but I thought it
might be worthwhile to explore whether the One True Arrow implementation from
Apache wants it first. Your implementation has the most visibility, it's what
the Apache Arrow web page links to, and it's what people wanting to know what
arrow feels like in Rust will end up finding first. It is therefore sad that it
has such bad UX.

---

> API compatibility: we still struggle to preserve this, but it gets
infinitely harder with generics exposing what are often implementation details

I don't see how the API that I proposed so far leaks more implementation
details than the current design of having &dyns that must be downcasted to
specific types (which, in the eyes of the users, are very much implementation
details).

Just check out the example in [the documentation of
`StructBuilder`](https://docs.rs/arrow-array/51.0.0/arrow_array/builder/struct.StructBuilder.html),
specifically this comment:

> We can't obtain the ListBuilder<StructBuilder> with the expected generic
types, because under the hood
> the StructBuilder was returned as a Box<dyn ArrayBuilder> and passed as
such to the ListBuilder constructor

If you don't call that implementation details leaking through, I don't know
what it is! :)

---

> Type Erasure: especially when working on query engines, you very often
don't know and don't want to have to know what the type of something is.
Downcasting to the concrete type is obnoxious, verbose, and macro heavy

I agree that having a type erased API is not bad per se. What I disagree
with is having that as the only option.

---

> Compilation Time: by not exposing the generics we can ensure they get
instantiated once, and use tricks to reduce the amount of code that gets
generated. Moving away from generics significantly improved compilation times

Generic code that is written with compile-time performance in mind (with
dynamic dispatch in non-perf-critical sections) only significantly increase
compilation time if they are instantiated a lot, for a lot of different
concrete arguments. This is typically a problem for APIs that take `impl Fn()`,
are used with many different callback types, and cannot afford to use &dyn for
performance reasons, like iterator adapters.

I don't believe this is as much of a problem for container-like types,
however, which are what you are building in arrow. The reason is that container
types tend to be often instantiated with the same arguments, or in our case to
have recursive instantiations that are themselves instantiated with the same
arguments. The compiler knows how to deduplicate such instances.

Further, in a container type whose primary purpose is I/O, like Arrow, you
can afford to use a lot more dynamic dispatch internally than a general-purpose
container like `Vec` or `BTreeMap` would, further reducing the cost of
instantiating a generic type with a new type argument.

---

> Codegen: LLVM is a very finicky beast, and a lot of care has gone into
making sure it properly vectorises code. This is very hard to do with generics
that are instantiated in the call site

Here, you need to bear in mind that the API which I am proposing can be
built as a relatively thin layer on top of the existing arrow code. Most
complexity would be in the trait machinery required to turn e.g. a push of
tuple into multiple pushes of tuple elements followed by a finish.

This would largely be resolved at compile time, with run-time mostly
dispatching into the optimized code that you already have. And in cases where
it's not sufficient (e.g. complex layouts like list of tuples of lists), I'm
proposing lower-level APIs that let you more directly target the underlying
ListBuilders and StructBuilders for performance, at the expense of API
ergonomics.

---

> My 2 cents is that arrow makes little sense in statically typed contexts,
specialized code will almost always win out both for ergonomics and
performance. Perhaps using crates like serde-arrow to convert back and forth
where necessary

For sure, nothing beats specialized code, but efficient I/O is also a lot of
code to write. You obviously put in a lot of care into having an efficient
on-disk format and I/O patterns in arrow, and it would be a shame not to reuse
it. Or is there an alternative columnar I/O library with support for nested
structs and lists that you can suggest which has equally developed Rust support
?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Proposal: Alternate APIs with stronger typing [arrow-rs]

Reply via email to