alamb commented on issue #197:
URL: https://github.com/apache/arrow-rs/issues/197#issuecomment-1008319749


   An update: manually audited all the uses of the  `size_hint` in the arrow 
codebase:
   
   ```shell
   rg -n -H --no-heading -e 'size_hint' $(git rev-parse --show-toplevel || pwd)
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/immutable.rs:308:            
    let (lower, _) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:387:        let 
(lower, _) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:433:        let 
(_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:475:        let 
(_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:524:        let 
(_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:621:            
let byte_capacity: usize = iterator.size_hint().0.saturating_add(7) / 8;
   /Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:653:              
      iterator.size_hint().0.saturating_add(7) / 8, //convert bit count to byte 
count, rounding up
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:338:       
 let (lower, _) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:376:    
/// I.e. that `size_hint().1` correctly reports its length.
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:384:       
 let (_, upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:959:       
 let (_, upper_size_bound) = value_iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_list.rs:153:        let 
(lower, _) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_binary.rs:236:        
let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_binary.rs:1125:        
let (_, upper_size_bound) = value_iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/builder.rs:271:            
.size_hint()
   /Users/alamb/Software/arrow-rs/arrow/src/array/builder.rs:777:            
.size_hint()
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:68:    fn 
size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:140:    fn 
size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:214:    fn 
size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:293:    fn 
size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:370:    fn 
size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/util/trusted_len.rs:36:    let (_, 
upper) = iterator.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/util/bit_chunk_iterator.rs:160:    
fn size_hint(&self) -> (usize, Option<usize>) {
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_boolean.rs:197:        
let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_dictionary.rs:179:      
  let (lower, _) = it.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_dictionary.rs:221:      
  let (lower, _) = it.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:174:        
let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:221:        
let (_, data_len) = iter.size_hint();
   /Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:548:        
let (_, upper_size_bound) = string_iter.size_hint();
   ```
   
   My analysis is that any place where `upper` is used to size a buffer, it is 
either: 
   1. Advisory for performance to allocate capacity (like `Vec::with_capacity`) 
but not required for correctness
   2. Used in an `unsafe` block which clearly states it is undefined behavior 
to call the function with an iterator that does not correctly report its size
   3. I did find one issue: https://github.com/apache/arrow-rs/issues/1144 
   
   For 2. above, the`unsafe` code will panic if the iterator's size was 
inaccurate (potentially after writing past valid memory) as an additional 
safeguard
   
   The only thing remaining for me to do on this issue is to verify the uses of 
the unsafe `from_trusted_iter` calls.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to