aaraujo commented on PR #17634:
URL: https://github.com/apache/datafusion/pull/17634#issuecomment-3328688468
@alamb @Jefffrey Thank you for the thorough review feedback. After
conducting an extensive investigation into the real-world impact of this issue,
I need to recommend **closing this PR**. Here's what we discovered:
### How We Originally Hit This Issue
This issue was first identified during PackDB integration testing where we
saw query translation failures. The initial investigation suggested that 96.6%
of dashboard queries were affected by qualified column reference resolution
after aggregation operations. However, a deeper analysis revealed this was a
red herring.
### What Our Investigation Revealed
#### 1. **PackDB Doesn't Actually Need This Fix**
After analyzing the current PackDB codebase (which uses DataFusion 50.0.0),
we discovered that PackDB naturally avoids this issue entirely through its
query translation patterns:
```rust
// PackDB's actual pattern (from promql/planner_impl.rs):
.aggregate(group_by_cols, vec![aggregate_expr.alias("value")])?
.project(vec![lit(context.end).alias("timestamp"), col("value")])?
// ^^^^^^^^^
// Always unqualified
```
PackDB consistently uses unqualified column references after aggregation,
which is the natural and correct DataFrame API usage pattern.
#### 2. The Fix is Incomplete
The current fix only addresses expression evaluation (get_type() and
nullable() methods) but doesn't fix the main claimed use case:
```rust
// This still fails even with the fix:
let result = aggregated.select(vec![col("metrics.avg_cpu")]);
// Error: Schema error: No field named metrics.avg_cpu
// Only this works (binary expressions):
let expr = col("metrics.avg_cpu") * lit(2.0); // Fixed by this PR
```
The fix doesn't extend to field_from_column() which is used by .select()
operations - the primary DataFrame API method mentioned in all examples.
#### 3. No Real-World Usage Pattern Identified
After extensive searching, we could only identify one contrived scenario
where this might be needed:
```rust
// Poorly designed query builder that retains qualifiers
fn build_query(table_name: &str) -> Result<DataFrame> {
let aggregated = table.aggregate(/*...*/)?;
// Bad pattern: accidentally using qualified reference to aggregated
column
aggregated.select(vec![
col(&format!("{}.avg_metric", table_name)) // This would fail
])
}
```
However, this represents poor API usage that should be discouraged, not
accommodated.
#### 4. Natural Programming Patterns Avoid This
Well-designed DataFrame code naturally avoids this issue:
```rust
// Natural pattern - developers reference the columns they create
let aggregated = table.aggregate(
vec![col("region")],
vec![avg(col("cpu_usage")).alias("avg_cpu")] // Create "avg_cpu"
)?;
let result = aggregated.select(vec![
col("region"),
col("avg_cpu") // Reference what we created - no qualification needed
])?;
```
Why This Issue Appeared Initially
The original PackDB integration issues were likely caused by:
1. Early development patterns that were later refactored
2. Query translation bugs that were fixed through different means
3. Misattribution - other schema resolution issues that weren't related to
qualified column references
Current Error Handling is Actually Good
The existing error message is clear and actionable:
```
Schema error: No field named metrics.avg_cpu. Did you mean 'avg_cpu'?
```
This guides users toward the correct unqualified column usage, which is the
proper DataFrame API pattern.
Recommendation
I recommend closing this PR because:
1. No real-world impact: The motivating use case (PackDB) doesn't actually
need this fix
2. Incomplete solution: Doesn't fix the main claimed use case (.select()
operations)
3. Encourages bad patterns: Would accommodate poor API usage rather than
guide users to better practices
4. No community need: No evidence of other DataFusion users hitting this
specific pattern
Thank you both for the detailed review process - it helped us discover that
the original problem had already been solved through better API usage patterns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]