morningman commented on PR #60197:
URL: https://github.com/apache/doris/pull/60197#issuecomment-3793531983

   ⏺ Code Review: PR #60197 - Fix Parquet Reader Cannot Push Down Conjuncts for 
Min-Max Filter
   
     Overview
   
     This PR refactors the predicate pushdown mechanism for Parquet files, 
specifically addressing how min-max filters and runtime predicates are handled. 
The main changes include:
   
     1. Simplified predicate handling: Removes the complex late runtime filter 
callback mechanism from ParquetReader
     2. Enhanced predicate evaluation: Adds missing evaluate_and methods for 
Parquet-specific column statistics in multiple predicate classes
     3. Runtime predicate refactoring: Removes tablet_schema dependency from 
RuntimePredicate, replacing it with direct DataType storage
     4. Improved filter handling: Adds proper handling for null statistics and 
row group ranges
   
     Analysis
   
     Strengths
   
     1. Architectural improvement: Removing the late RF callback mechanism 
(_call_late_rf_func) simplifies the code flow and reduces complexity
     2. Consistent null handling: Properly handles is_all_null cases in 
comparison and in-list predicates (comparison_predicate.h:201, 
in_list_predicate.h:285)
     3. Better abstraction: Using DataTypePtr instead of TabletSchema in 
RuntimePredicate is cleaner and more appropriate for external tables
   
     Critical Issues
   
     1. Missing Row Group Range Addition
   
     Severity: High
   
     In multiple files, when get_stat_func returns false (indicating no page 
index stats), the code now adds row_group_range to include all rows:
   
   ```
     if (!(statistic->get_stat_func)(&stat, column_id())) {
         row_ranges->add(statistic->row_group_range);
         return true;
     }
   ```
   
     Files affected: comparison_predicate.h:229, in_list_predicate.h:313, 
null_predicate.h:99, shared_predicate.h:184
   
     Issue: The original behavior when stats are unavailable was to return true 
(conservative - read everything). Now it explicitly adds the row group range. 
While semantically similar, this changes behavior and could have performance 
implications if the calling code doesn't expect row ranges to be modified when 
stats are missing.
   
     Recommendation: Verify this behavioral change is intentional and document 
why row ranges should be explicitly added when stats are unavailable.
   
     2. Removed Code in tablet_reader.cpp
   
     Severity: Medium
   
     Lines 527-533 in tablet_reader.cpp removed the registration of TopN filter 
tablet schemas:
   
   ```
     -    for (int id : read_params.topn_filter_source_node_ids) {
     -        auto& runtime_predicate =
     -                
read_params.runtime_state->get_query_ctx()->get_runtime_predicate(id);
     -        
RETURN_IF_ERROR(runtime_predicate.set_tablet_schema(read_params.topn_filter_target_node_id,
     -                                                            
_tablet_schema));
     -    }
   ```
   
     Issue: This code was called for internal tablet reads (OLAP engine). The 
removal suggests TopN filters for internal tables might be broken, or this was 
dead code. The PR description doesn't explain this change.
   
     Recommendation: Clarify whether this affects internal table TopN 
filtering. If it does, add test coverage.
   
     3. AcceptNullPredicate Row Range Logic
   
     Severity: Medium
   
     In accept_null_predicate.h:110-121, the new evaluate_and implementation 
adds rows where has_null[page_id] is true:
   
   ```
     for (int page_id = 0; page_id < stat->num_of_pages; page_id++) {
         if (stat->has_null[page_id]) {
             row_ranges->add(stat->ranges[page_id]);
         }
     }
   ```
   
     Issue: This only adds pages with nulls but doesn't call the nested 
predicate's evaluate_and. The logic should be:
     - Add pages that match the nested predicate OR have nulls
     - Currently, it only adds null pages regardless of nested predicate result
   
     Recommendation: The implementation appears incomplete. It should combine 
results from the nested predicate with null page ranges.
   
     4. Removed Lazy Read Context Update
   
     Severity: High
   
     vparquet_reader.cpp removed significant logic including:
     - Runtime filter pushdown (lines 450-478)
     - Late arrival RF handling (lines 714-718)
     - The entire _update_lazy_read_ctx method refactoring
   
     Issue: The PR claims to "fix parquet reader cannot push down conjuncts" 
but actually removes pushdown logic for:
     - Binary predicates (GE, LE) from runtime filters
     - IN predicates from runtime filters
     - Late-arriving runtime filters
   
     Recommendation: This appears contradictory to the PR title. Need 
clarification on whether this is:
     a) Moving pushdown logic elsewhere (where?)
     b) Intentionally removing broken logic (why?)
     c) A regression
   
     Minor Issues
   
     1. Inconsistent Error Handling: comparison_predicate.h:198 and 
in_list_predicate.h:282 now check is_all_null before parsing min/max, but this 
check wasn't in the original code. Verify this is correct for all Parquet files.
     2. Missing Documentation: The PR description is incomplete - no problem 
summary, no release note justification, no test plan.
     3. Unused Code Path: file_scan_operator.h:69-73 adds _push_down_topn that 
always returns true, with a comment saying the reader will determine if it can 
be pushed. But the reader logic was just removed.
   
     Missing Elements
   
     1. No Tests: PR checklist shows no test coverage mentioned
     2. No Problem Statement: The PR description is empty
     3. No Verification: How was this tested? What queries benefit?
     4. Breaking Changes: Potential behavior changes for internal tables (TopN 
filters) not addressed
   
     Recommendations
   
     Must Fix Before Merge
   
     1. Add comprehensive test coverage including:
       - Parquet files with/without page indexes
       - Null handling in various predicate types
       - TopN filter behavior for both internal and external tables
       - Runtime filter pushdown scenarios
     2. Fill out PR description explaining:
       - What bug this fixes (with example)
       - Why removing pushdown logic is correct
       - Impact on existing functionality
     3. Fix AcceptNullPredicate logic - the row range addition appears incorrect
     4. Verify tablet_reader.cpp changes - ensure TopN filters still work for 
internal tables
   
     Should Consider
   
     1. Add comments explaining why row group ranges are added when stats are 
unavailable
     2. Document the behavior change from callback-based to upfront predicate 
evaluation
     3. Consider whether _push_down_topn returning true unconditionally is safe
   
     Risk Assessment
   
     Overall Risk: High
   
     - Significant removal of existing functionality (RF pushdown, late arrival 
handling)
     - Potential regression in query performance for Parquet files
     - Possible breaking change for internal table TopN filters
     - No test coverage to validate correctness
     - PR description provides no context for reviewing safety
   
     Recommendation: Request changes - This PR needs substantial clarification, 
testing, and potentially rework before merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to