Ten0 commented on issue #4886:
URL: https://github.com/apache/arrow-rs/issues/4886#issuecomment-1894684487

   >  typically the way to achieve performant decode is to decode the values 
for a column at a time, as this allows amortizing per-row overheads, reducing 
branch misses, etc... This would obviously not be possible with the serde model 
which is inherently value-oriented, but it is also possible that the nature of 
the avro encoding, which relies extensively on varint encoding, reduces the 
benefits of such a columnar approach.
   
   Oh that's interesting! I would have imagined that we would have prepared all 
the vectors and be pushing to each of them as we read each field. What are you 
referring to with regards to per-row overheads? (I'd like to read documentation 
on this topic, I'm familiar with branch prediction but not this.)
   
   That being said indeed with avro's encoding where you have to precisely 
deserialize each field of an object before you know where the next object 
starts, plus with the block encoding with the compression, it's very hard for 
me to imagine that reading several times to extract a single field each time 
would be the most performant approach. (But even if that was the case, that 
would look very close to driving the deserializer multiple times, just ignoring 
all the fields but one each time.)
   
   > I'll try to get what I have polished up over the next few days, and we can 
compare benchmarks.
   
   Wonderful! 😊
   
   > Here is the way it is implemented in datafusion
   
   
https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion/core/src/datasource/avro_to_arrow/reader.rs#L160C1-L168C2
   
   So IIUC the interface we'd want is basically something that enables to 
convert from an arbitrary `(Buf)Read`s to something that yields 
[`RecordBatch`](https://docs.rs/datafusion/latest/datafusion/common/arrow/array/struct.RecordBatch.html)es?
 If somebody confirms that I'm not looking at the wrong track here I may give a 
go at implementing something for this based on Serde on which we could then 
plug notably avro support via `serde_avro_fast` (and if that doesn't work with 
just serde maybe get a pluggable part for arrow schema specification that could 
be populated by reading the avro schema). 🙂
   Side note: how many `RecordBatch`es? Why not just one? How does one choose 
this? Is this because datafusion wants to be able to process very large files 
by stream-processing the batches?
   Side note 2: I will notably have a look at 
[`serde_arrow`](https://lib.rs/crates/serde_arrow) as well for that purpose - 
I'm not sure to what extent that implementation is optimal for this purpose 
currently but it seems to be under active development and it looks like 
fundamentally `serde_avro_fast`->`serde_transcode`->`serde_arrow` is precisely 
what I'd be looking for. If that is the case the implementation would be sooo 
simple 😄 (My first glance has me wonder why [the implementation is so 
complex](https://github.com/chmp/serde_arrow/blob/519c6ee4ae74904b17b12616c8400e83ab206faf/serde_arrow/src/arrow_impl/api.rs#L331-L336)
 but then I don't know too much about constructing arrow values)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to