adriangb commented on code in PR #18868:
URL: https://github.com/apache/datafusion/pull/18868#discussion_r2689131930


##########
datafusion/datasource-parquet/src/row_group_filter.rs:
##########
@@ -70,6 +79,109 @@ impl RowGroupAccessPlanFilter {
         self.access_plan
     }
 
+    /// Returns the is_fully_matched vector
+    pub fn is_fully_matched(&self) -> &Vec<bool> {
+        &self.is_fully_matched
+    }
+
+    /// Prunes the access plan based on the limit and fully contained row 
groups.
+    ///
+    /// The pruning works by leveraging the concept of fully matched row 
groups. Consider a query like:
+    /// `WHERE species LIKE 'Alpine%' AND s >= 50 LIMIT N`
+    ///
+    /// After initial filtering, row groups can be classified into three 
states:
+    ///
+    /// 1. Not Matching / Pruned
+    /// 2. Partially Matching (Row Group/Page contains some matches)
+    /// 3. Fully Matching (Entire range is within predicate)
+    ///
+    /// 
+-----------------------------------------------------------------------+
+    /// |                            NOT MATCHING                              
 |
+    /// |  Row group 1                                                         
 |
+    /// |  +-----------------------------------+-----------------------------+ 
 |
+    /// |  | SPECIES                           | S                           | 
 |
+    /// |  +-----------------------------------+-----------------------------+ 
 |
+    /// |  | Snow Vole                         | 7                           | 
 |
+    /// |  | Brown Bear                        | 133 ✅                      |  
|
+    /// |  | Gray Wolf                         | 82  ✅                      |  
|
+    /// |  +-----------------------------------+-----------------------------+ 
 |
+    /// 
+-----------------------------------------------------------------------+
+    ///
+    /// 
+---------------------------------------------------------------------------+
+    /// |                          PARTIALLY MATCHING                          
     |
+    /// |                                                                      
     |
+    /// |  Row group 2                              Row group 4                
     |
+    /// |  +------------------+--------------+      
+------------------+----------+ |
+    /// |  | SPECIES          | S            |      | SPECIES          | S     
   | |
+    /// |  +------------------+--------------+      
+------------------+----------+ |
+    /// |  | Lynx             | 71 ✅        |      | Europ. Mole      | 4      
  | |
+    /// |  | Red Fox          | 40           |      | Polecat          | 16    
   | |
+    /// |  | Alpine Bat  ✅   | 6            |      | Alpine Ibex ✅  | 97 ✅    
| |
+    /// |  +------------------+--------------+      
+------------------+----------+ |
+    /// 
+---------------------------------------------------------------------------+
+    ///
+    /// 
+-----------------------------------------------------------------------+
+    /// |                           FULLY MATCHING                             
 |
+    /// |  Row group 3                                                         
 |
+    /// |  +-----------------------------------+-----------------------------+ 
 |
+    /// |  | SPECIES                           | S                           | 
 |
+    /// |  +-----------------------------------+-----------------------------+ 
 |
+    /// |  | Alpine Ibex  ✅                  | 101    ✅                   |  |
+    /// |  | Alpine Goat  ✅                  | 76     ✅                   |  |
+    /// |  | Alpine Sheep ✅                  | 83     ✅                   |  |
+    /// |  +-----------------------------------+-----------------------------+ 
 |
+    /// 
+-----------------------------------------------------------------------+

Review Comment:
   Can you edit the example to truncation length 3? Length 6 is conveniently 
the same as the needle expression, I think it's important to show what happens 
when it's shorter. Also I think `col like 'foo%'` does not generate a predicate 
involving `like`: `predicate=species@0 LIKE Alpine%, 
pruning_predicate=species_null_count@2 != row_count@3 AND species_min@0 <= 
Alpinf AND Alpine <= species_max@1`. So in this case we'd end up with `NOT 
('Alp' <= 'Alpinf' AND 'Alpine' <= 'Alq')` (those are the truncated stats).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to