Ten0 commented on issue #4886: URL: https://github.com/apache/arrow-rs/issues/4886#issuecomment-1894684487
> typically the way to achieve performant decode is to decode the values for a column at a time, as this allows amortizing per-row overheads, reducing branch misses, etc... This would obviously not be possible with the serde model which is inherently value-oriented, but it is also possible that the nature of the avro encoding, which relies extensively on varint encoding, reduces the benefits of such a columnar approach. Oh that's interesting! I would have imagined that we would have prepared all the vectors and be pushing to each of them as we read each field. What are you referring to with regards to per-row overheads? (I'd like to read documentation on this topic, I'm familiar with branch prediction but not this.) That being said indeed with avro's encoding where you have to precisely deserialize each field of an object before you know where the next object starts, plus with the block encoding with the compression, it's very hard for me to imagine that reading several times to extract a single field each time would be the most performant approach. (But even if that was the case, that would look very close to driving the deserializer multiple times, just ignoring all the fields but one each time.) > I'll try to get what I have polished up over the next few days, and we can compare benchmarks. Wonderful! 😊 > Here is the way it is implemented in datafusion https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion/core/src/datasource/avro_to_arrow/reader.rs#L160C1-L168C2 So IIUC the interface we'd want is basically something that enables to convert from an arbitrary `(Buf)Read`s to something that yields [`RecordBatch`](https://docs.rs/datafusion/latest/datafusion/common/arrow/array/struct.RecordBatch.html)es? If somebody confirms that I'm not looking at the wrong track here I may give a go at implementing something for this based on Serde on which we could then plug notably avro support via `serde_avro_fast` (and if that doesn't work with just serde maybe get a pluggable part for arrow schema specification that could be populated by reading the avro schema). 🙂 Side note: how many `RecordBatch`es? Why not just one? How does one choose this? Is this because datafusion wants to be able to process very large files by stream-processing the batches? Side note 2: I will notably have a look at [`serde_arrow`](https://lib.rs/crates/serde_arrow) as well for that purpose - I'm not sure to what extent that implementation is optimal for this purpose currently but it seems to be under active development and it looks like fundamentally `serde_avro_fast`->`serde_transcode`->`serde_arrow` is precisely what I'd be looking for. If that is the case the implementation would be sooo simple 😄 (My first glance has me wonder why [the implementation is so complex](https://github.com/chmp/serde_arrow/blob/519c6ee4ae74904b17b12616c8400e83ab206faf/serde_arrow/src/arrow_impl/api.rs#L331-L336) but then I don't know too much about constructing arrow values) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
