HadrienG2 commented on issue #5700:
URL: https://github.com/apache/arrow-rs/issues/5700#issuecomment-2082403812

   Thanks for your reply!
   
   > I believe the narrow crate has some similar goals and might be worth 
checking out.
   
   I may eventually be taking this to narrow instead, then, but I thought it 
might be worthwhile to explore whether the One True Arrow implementation from 
Apache wants it first. Your implementation has the most visibility, it's what 
the Apache Arrow web page links to, and it's what people wanting to know what 
arrow feels like in Rust will end up finding first. It is therefore sad that it 
has such bad UX.
   
   ---
   
   > API compatibility: we still struggle to preserve this, but it gets 
infinitely harder with generics exposing what are often implementation details
   
   I don't see how the API that I proposed so far leaks more implementation 
details than the current design of having &dyns that must be downcasted to 
specific types (which, in the eyes of the users, are very much implementation 
details).
   
   Just check out the example in [the documentation of 
`StructBuilder`](https://docs.rs/arrow-array/51.0.0/arrow_array/builder/struct.StructBuilder.html),
 specifically this comment:
   
   > We can't obtain the ListBuilder<StructBuilder> with the expected generic 
types, because under the hood
   > the StructBuilder was returned as a Box<dyn ArrayBuilder> and passed as 
such to the ListBuilder constructor
   
   If you don't call that implementation details leaking through, I don't know 
what it is! :)
   
   ---
   
   > Type Erasure: especially when working on query engines, you very often 
don't know and don't want to have to know what the type of something is. 
Downcasting to the concrete type is obnoxious, verbose, and macro heavy
   
   I agree that having a type erased API is not bad per se. What I disagree 
with is having that as the only option.
   
   ---
   
   > Compilation Time: by not exposing the generics we can ensure they get 
instantiated once, and use tricks to reduce the amount of code that gets 
generated. Moving away from generics significantly improved compilation times
   
   Generic code that is written with compile-time performance in mind (with 
dynamic dispatch in non-perf-critical sections) only significantly increase 
compilation time if they are instantiated a lot, for a lot of different 
concrete arguments. This is typically a problem for APIs that take `impl Fn()`, 
are used with many different callback types, and cannot afford to use &dyn for 
performance reasons, like iterator adapters.
   
   I don't believe this is as much of a problem for container-like types, 
however, which are what you are building in arrow. The reason is that container 
types tend to be often instantiated with the same arguments, or in our case to 
have recursive instantiations that are themselves instantiated with the same 
arguments. The compiler knows how to deduplicate such instances.
   
   Further, in a container type whose primary purpose is I/O, like Arrow, you 
can afford to use a lot more dynamic dispatch internally than a general-purpose 
container like `Vec` or `BTreeMap` would, further reducing the cost of 
instantiating a generic type with a new type argument.
   
   ---
   
   > Codegen: LLVM is a very finicky beast, and a lot of care has gone into 
making sure it properly vectorises code. This is very hard to do with generics 
that are instantiated in the call site
   
   Here, you need to bear in mind that the API which I am proposing can be 
built as a relatively thin layer on top of the existing arrow code. Most 
complexity would be in the trait machinery required to turn e.g. a push of 
tuple into multiple pushes of tuple elements followed by a finish.
   
   This would largely be resolved at compile time, with run-time mostly 
dispatching into the optimized code that you already have. And in cases where 
it's not sufficient (e.g. complex layouts like list of tuples of lists), I'm 
proposing lower-level APIs that let you more directly target the underlying 
ListBuilders and StructBuilders for performance, at the expense of API 
ergonomics.
   
   ---
   
   > My 2 cents is that arrow makes little sense in statically typed contexts, 
specialized code will almost always win out both for ergonomics and 
performance. Perhaps using crates like serde-arrow to convert back and forth 
where necessary
   
   For sure, nothing beats specialized code, but efficient I/O is also a lot of 
code to write. You obviously put in a lot of care into having an efficient 
on-disk format and I/O patterns in arrow, and it would be a shame not to reuse 
it. Or is there an alternative columnar I/O library with support for nested 
structs and lists that you can suggest which has equally developed Rust support 
?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to