edmondop commented on issue #8156:
URL: 
https://github.com/apache/arrow-datafusion/issues/8156#issuecomment-1810390116

   > > FYI, my first exploration shows that this is not easy to do unless we 
perform `row.clone()`. Is that acceptable from a perfomance standpoint?
   > 
   > We can benchmark them to see the performance changes. For me, I will 
prefer the one faster but not the one has less code duplication.
   
   then I would say we can close this, my initial draft has this approach:
   
   ```rust
   
   trait ArraySetOperationAccumulator<'a> {
       fn process_left(&mut self, row: Row<'a>) -> Result<()>;
       fn process_right(&mut self, row: Row<'a>) -> Result<()>;
       fn rows(&self) -> &[Row<'a>];
       fn clear(&mut self) -> ();
   }
   
   trait ArraySetOperationAccumulatorFactory {
       type Accumulator: for<'a> ArraySetOperationAccumulator<'a>;
   
       fn create_accumulator<'a>(&self) -> Self::Accumulator;
   }
   
   struct ArrayUnionAccumulator<'a> {
       dedup: HashSet<Row<'a>>,
       rows: Vec<Row<'a>>,
   }
   
   fn insert_if_absent<'a>(acc: &mut ArrayUnionAccumulator<'a>, row: Row<'a>) 
-> Result<()> {
       if acc.dedup.insert(row) {
           acc.rows.push(row);
       }
       Ok(())
   }
   
   impl<'a> ArraySetOperationAccumulator<'a> for ArrayUnionAccumulator<'a> {
   
       fn process_left(&mut self, row: Row<'a>) -> Result<()> {
           insert_if_absent(self, row)
       }
   
       fn process_right(&mut self, row: Row<'a>) -> Result<()> {
           insert_if_absent(self, row)
       }
   
       fn rows(&self) -> &[Row<'a>] {
           self.rows.as_slice()
       }
   
       fn clear(&mut self) -> () {
           self.dedup.clear();
           self.rows.clear();
       }
   
   }
   
   
   pub struct ArrayUnionAccumulatorFactory {}
   
   # does not compile
   impl ArraySetOperationAccumulatorFactory for ArrayUnionAccumulatorFactory {
       
       type Accumulator = ArrayUnionAccumulator;
       
       fn create_accumulator<'a>(&self) -> Self::Accumulator {
           ArrayUnionAccumulator {
               dedup: HashSet::new(),
               rows: Vec::new(),
           }
       }
   
   
   }
   
   fn array_set_operation<OffsetSize,AccumulatorFactory>(
       l: &GenericListArray<OffsetSize>,
       r: &GenericListArray<OffsetSize>,
       field: &FieldRef,
       accumulator_factory: AccumulatorFactory,
   ) -> Result<GenericListArray<OffsetSize>>
   where
       OffsetSize: OffsetSizeTrait,
       AccumulatorFactory: ArraySetOperationAccumulatorFactory {
       
       let converter = 
RowConverter::new(vec![SortField::new(l.value_type().clone())])?;
   
       let nulls = NullBuffer::union(l.nulls(), r.nulls());
       let l_values = l.values().clone();
       let r_values = r.values().clone();
       let l_values = converter.convert_columns(&[l_values])?;
       let r_values = converter.convert_columns(&[r_values])?;
           
       let mut accumulator = accumulator_factory.create_accumulator();
       let mut offsets = Vec::<OffsetSize>::with_capacity(l.len() + 1);
       offsets.push(OffsetSize::usize_as(0));
       let mut rows = Vec::with_capacity(l_values.num_rows() + 
r_values.num_rows());
       for (l_w, r_w) in l.offsets().windows(2).zip(r.offsets().windows(2)) {
           let l_slice = l_w[0].as_usize()..l_w[1].as_usize();
           let r_slice = r_w[0].as_usize()..r_w[1].as_usize();
           for i in l_slice {
               let row = l_values.row(i);
               accumulator.process_left(row)?;
           }
           for i in r_slice {
               let row = r_values.row(i);
               accumulator.process_right(row)?;
           }
           rows.extend(accumulator.rows().iter());
           offsets.push(OffsetSize::usize_as(rows.len()));
           accumulator.clear()
       }
   
       let values = converter.convert_rows(rows)?;
       let offsets = OffsetBuffer::new(offsets.into());
       let result = values[0].clone();
       Ok(GenericListArray::<OffsetSize>::new(
           field.clone(),
           offsets,
           result,
           nulls,
       ))
   } 
   ```
   
   the problem with this code is that it's not possible, at least in my 
undrestanding to align the lifetime `<'a>` with the internal lifetime of the 
rows in the function. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to