andygrove opened a new issue #900:
URL: https://github.com/apache/arrow-datafusion/issues/900


   **Describe the bug**
   We push pruning predicates and limits down to ParquetExec but it seems like 
the combination could be unsafe, or perhaps I am not comprehending the logic?
   
   Given the predicate `x > 10` and a limit of 10, suppose we have the 
following 2 partitions:
   
   - Partition 0 has 100 rows and 5 rows match `x > 10`
   - Partition 1 has 100 rows and 5 rows match 'x > 10'
   
   As we iterate over row groups we have
   
   ```rust
   for row_group_meta in meta_data.row_groups() {
       num_rows += row_group_meta.num_rows();
   ```
   
   we break out of this loop once hitting the limit, based on num_rows
   
   ```rust
   if limit.map(|x| num_rows >= x as i64).unwrap_or(false) {
       limit_exhausted = true;
       break;
   }
   ```
   
   This stops processing the file once the limit is reached, without 
considering how many rows the predicate would match.
   
   Finally we stop processing partitions as well, here:
   
   ```rust
   // remove files that are not needed in case of limit
   filenames.truncate(total_files);
   partitions.push(ParquetPartition::new(filenames, statistics));
   if limit_exhausted {
       break;
   }
   ```
   
   **To Reproduce**
   When I have time will write a test to see if there is an issue here.
   
   **Expected behavior**
   Perhaps we should not apply the limit when we are pushing predicates down?
   
   **Additional context**
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to