HadrienG2 commented on issue #5700: URL: https://github.com/apache/arrow-rs/issues/5700#issuecomment-2082403812
Thanks for your reply! > I believe the narrow crate has some similar goals and might be worth checking out. I may eventually be taking this to narrow instead, then, but I thought it might be worthwhile to explore whether the One True Arrow implementation from Apache wants it first. Your implementation has the most visibility, it's what the Apache Arrow web page links to, and it's what people wanting to know what arrow feels like in Rust will end up finding first. It is therefore sad that it has such bad UX. --- > API compatibility: we still struggle to preserve this, but it gets infinitely harder with generics exposing what are often implementation details I don't see how the API that I proposed so far leaks more implementation details than the current design of having &dyns that must be downcasted to specific types (which, in the eyes of the users, are very much implementation details). Just check out the example in [the documentation of `StructBuilder`](https://docs.rs/arrow-array/51.0.0/arrow_array/builder/struct.StructBuilder.html), specifically this comment: > We can't obtain the ListBuilder<StructBuilder> with the expected generic types, because under the hood > the StructBuilder was returned as a Box<dyn ArrayBuilder> and passed as such to the ListBuilder constructor If you don't call that implementation details leaking through, I don't know what it is! :) --- > Type Erasure: especially when working on query engines, you very often don't know and don't want to have to know what the type of something is. Downcasting to the concrete type is obnoxious, verbose, and macro heavy I agree that having a type erased API is not bad per se. What I disagree with is having that as the only option. --- > Compilation Time: by not exposing the generics we can ensure they get instantiated once, and use tricks to reduce the amount of code that gets generated. Moving away from generics significantly improved compilation times Generic code that is written with compile-time performance in mind (with dynamic dispatch in non-perf-critical sections) only significantly increase compilation time if they are instantiated a lot, for a lot of different concrete arguments. This is typically a problem for APIs that take `impl Fn()`, are used with many different callback types, and cannot afford to use &dyn for performance reasons, like iterator adapters. I don't believe this is as much of a problem for container-like types, however, which are what you are building in arrow. The reason is that container types tend to be often instantiated with the same arguments, or in our case to have recursive instantiations that are themselves instantiated with the same arguments. The compiler knows how to deduplicate such instances. Further, in a container type whose primary purpose is I/O, like Arrow, you can afford to use a lot more dynamic dispatch internally than a general-purpose container like `Vec` or `BTreeMap` would, further reducing the cost of instantiating a generic type with a new type argument. --- > Codegen: LLVM is a very finicky beast, and a lot of care has gone into making sure it properly vectorises code. This is very hard to do with generics that are instantiated in the call site Here, you need to bear in mind that the API which I am proposing can be built as a relatively thin layer on top of the existing arrow code. Most complexity would be in the trait machinery required to turn e.g. a push of tuple into multiple pushes of tuple elements followed by a finish. This would largely be resolved at compile time, with run-time mostly dispatching into the optimized code that you already have. And in cases where it's not sufficient (e.g. complex layouts like list of tuples of lists), I'm proposing lower-level APIs that let you more directly target the underlying ListBuilders and StructBuilders for performance, at the expense of API ergonomics. --- > My 2 cents is that arrow makes little sense in statically typed contexts, specialized code will almost always win out both for ergonomics and performance. Perhaps using crates like serde-arrow to convert back and forth where necessary For sure, nothing beats specialized code, but efficient I/O is also a lot of code to write. You obviously put in a lot of care into having an efficient on-disk format and I/O patterns in arrow, and it would be a shame not to reuse it. Or is there an alternative columnar I/O library with support for nested structs and lists that you can suggest which has equally developed Rust support ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
