Rachelint commented on issue #11281:
URL: https://github.com/apache/datafusion/issues/11281#issuecomment-2212499867

   > Yes - the primary (initial) problem is that the collection needs to be 
built so that it owns the references to the items but we want to do that 
without creating any intermediate values.
   > 
   > I would also have naively expected that `from_iter` should perform 
comparably. I did notice that there are two implementations for 
GenericByteArray and I'm not clear which one would be chosen here: 
https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/byte_array.rs#L534-L552
   > 
   > The choice depends on lifetimes, so perhaps the other one is being invoked 
and it's not pre-allocating capacity in as efficient a way?
   > 
   > In general, I think you're right that we should be able to eliminate all 
intermediate vectors.
   
   :joy: Sorry, it seems my statement made some misunderstandings?
   The reason why `from_iter` case slow seems due to the `to_string()` call 
after I profile it... and I did not notice it when I  first started read the 
codes...
   But still somethings confused me now... When using the builder directly, 
looping in `flatten` way is obviously slower than looping it in `nested` way...
   - flatten case
   
https://github.com/Rachelint/arrow-datafusion/blob/70b9f05e737e81c514259e70a8bd2e9f0ad8e725/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L890-L893
   ```
   Extract data page statistics for String/extract_statistics/String
                           time:   [43.454 µs 43.463 µs 43.474 µs]
   ```
   - nested case
   
https://github.com/Rachelint/arrow-datafusion/blob/70b9f05e737e81c514259e70a8bd2e9f0ad8e725/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L909-L913
   ```
   Extract data page statistics for String/extract_statistics/String
                           time:   [33.108 µs 33.119 µs 33.132 µs]
   ```
   I can understand some costs exist when using flatten, but it seems should 
not be so obvious?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to