Dandandan commented on pull request #9036:
URL: https://github.com/apache/arrow/pull/9036#issuecomment-752060487


   An important source of slowness seems to be in the (use and inefficiency of) 
creating the `MutableArrayData` structure. In profiling I see a lot in 
`build_extend`, `freeze` etc. 
   
   Changing the piece of code to generate a `Vec<&ArrayData>` directly gives a 
~10% speedup locally on batches of size 1000 on your branch @andygrove :
   ```rust
           let (is_primary, arrays) = match 
primary[0].schema().index_of(field.name()) {
               Ok(i) => Ok((true, primary.iter().map(|batch| 
batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
               Err(_) => {
                   match secondary[0].schema().index_of(field.name()) {
                       Ok(i) => Ok((false, secondary.iter().map(|batch| 
batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
                       _ => Err(DataFusionError::Internal(
                           format!("During execution, the column {} was not 
found in neither the left or right side of the join", field.name()).to_string()
                       ))
                   }
               }
           }.map_err(DataFusionError::into_arrow_external_error)?;
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to