Re: [PR] fix: resolve qualified column references after aggregation [datafusion]

via GitHub Thu, 25 Sep 2025 17:25:27 -0700


aaraujo commented on PR #17634:
URL: https://github.com/apache/datafusion/pull/17634#issuecomment-3328688468


   @alamb @Jefffrey Thank you for the thorough review feedback. After 
conducting an extensive investigation into the real-world impact of this issue, 
I need to recommend **closing this PR**. Here's what we discovered:
   
   ### How We Originally Hit This Issue
   
   This issue was first identified during PackDB integration testing where we 
saw query translation failures. The initial investigation suggested that 96.6% 
of dashboard queries were affected by qualified column reference resolution 
after aggregation operations. However, a deeper analysis revealed this was a 
red herring.
   
   ### What Our Investigation Revealed
   
   #### 1. **PackDB Doesn't Actually Need This Fix**
   
   After analyzing the current PackDB codebase (which uses DataFusion 50.0.0), 
we discovered that PackDB naturally avoids this issue entirely through its 
query translation patterns:
   
   ```rust
   // PackDB's actual pattern (from promql/planner_impl.rs):
   .aggregate(group_by_cols, vec![aggregate_expr.alias("value")])?
   .project(vec![lit(context.end).alias("timestamp"), col("value")])?
   //                                                   ^^^^^^^^^
   //                                              Always unqualified
   ```
   
   PackDB consistently uses unqualified column references after aggregation, 
which is the natural and correct DataFrame API usage pattern.
   
   #### 2. The Fix is Incomplete
   
   The current fix only addresses expression evaluation (get_type() and 
nullable() methods) but doesn't fix the main claimed use case:
   
   ```rust
   // This still fails even with the fix:
   let result = aggregated.select(vec![col("metrics.avg_cpu")]);
   // Error: Schema error: No field named metrics.avg_cpu
   
   // Only this works (binary expressions):
   let expr = col("metrics.avg_cpu") * lit(2.0); // Fixed by this PR
   ```
   
   The fix doesn't extend to field_from_column() which is used by .select() 
operations - the primary DataFrame API method mentioned in all examples.
   
   #### 3. No Real-World Usage Pattern Identified
   
   After extensive searching, we could only identify one contrived scenario 
where this might be needed:
   
   ```rust
   // Poorly designed query builder that retains qualifiers
   fn build_query(table_name: &str) -> Result<DataFrame> {
       let aggregated = table.aggregate(/*...*/)?;
   
       // Bad pattern: accidentally using qualified reference to aggregated 
column  
       aggregated.select(vec![
           col(&format!("{}.avg_metric", table_name)) // This would fail
       ])
   }
   ```
   
   However, this represents poor API usage that should be discouraged, not 
accommodated.
   
   #### 4. Natural Programming Patterns Avoid This
   
   Well-designed DataFrame code naturally avoids this issue:
   
   ```rust
   // Natural pattern - developers reference the columns they create
   let aggregated = table.aggregate(
       vec![col("region")],
       vec![avg(col("cpu_usage")).alias("avg_cpu")]  // Create "avg_cpu"
   )?;
   
   let result = aggregated.select(vec![
       col("region"),
       col("avg_cpu")  // Reference what we created - no qualification needed
   ])?;
   ```
   
   Why This Issue Appeared Initially
   
   The original PackDB integration issues were likely caused by:
   
   1. Early development patterns that were later refactored
   2. Query translation bugs that were fixed through different means
   3. Misattribution - other schema resolution issues that weren't related to 
qualified column references
   
   Current Error Handling is Actually Good
   
   The existing error message is clear and actionable:
   ```
   Schema error: No field named metrics.avg_cpu. Did you mean 'avg_cpu'?
   ```
   
   This guides users toward the correct unqualified column usage, which is the 
proper DataFrame API pattern.
   
   Recommendation
   
   I recommend closing this PR because:
   
   1. No real-world impact: The motivating use case (PackDB) doesn't actually 
need this fix
   2. Incomplete solution: Doesn't fix the main claimed use case (.select() 
operations)
   3. Encourages bad patterns: Would accommodate poor API usage rather than 
guide users to better practices
   4. No community need: No evidence of other DataFusion users hitting this 
specific pattern
   
   Thank you both for the detailed review process - it helped us discover that 
the original problem had already been solved through better API usage patterns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: resolve qualified column references after aggregation [datafusion]

Reply via email to