Rachelint commented on issue #11281: URL: https://github.com/apache/datafusion/issues/11281#issuecomment-2212499867
> Yes - the primary (initial) problem is that the collection needs to be built so that it owns the references to the items but we want to do that without creating any intermediate values. > > I would also have naively expected that `from_iter` should perform comparably. I did notice that there are two implementations for GenericByteArray and I'm not clear which one would be chosen here: https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/byte_array.rs#L534-L552 > > The choice depends on lifetimes, so perhaps the other one is being invoked and it's not pre-allocating capacity in as efficient a way? > > In general, I think you're right that we should be able to eliminate all intermediate vectors. :joy: Sorry, it seems my statement made some misunderstandings? The reason why `from_iter` case slow seems due to the `to_string()` call after I profile it... and I did not notice it when I first started read the codes... But still somethings confused me now... When using the builder directly, looping in `flatten` way is obviously slower than looping it in `nested` way... - flatten case https://github.com/Rachelint/arrow-datafusion/blob/70b9f05e737e81c514259e70a8bd2e9f0ad8e725/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L890-L893 ``` Extract data page statistics for String/extract_statistics/String time: [43.454 µs 43.463 µs 43.474 µs] ``` - nested case https://github.com/Rachelint/arrow-datafusion/blob/70b9f05e737e81c514259e70a8bd2e9f0ad8e725/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L909-L913 ``` Extract data page statistics for String/extract_statistics/String time: [33.108 µs 33.119 µs 33.132 µs] ``` I can understand some costs exist when using flatten, but it seems should not be so obvious? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
