konstantinb commented on code in PR #6331:
URL: https://github.com/apache/hive/pull/6331#discussion_r2842426035
##########
ql/src/test/results/clientpositive/llap/union26.q.out:
##########
@@ -129,20 +129,20 @@ STAGE PLANS:
Select Operator
expressions: _col0 (type: string), _col1 (type:
string)
outputColumnNames: _col0, _col1
- Statistics: Num rows: 500 Data size: 115500 Basic
stats: COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 500 Data size: 89000 Basic
stats: COMPLETE Column stats: COMPLETE
Review Comment:
This is a typical example of LV column stats impacting the data size
estimations of SELECT columns:
` Column Naming
| Context | Column Name | Represents | avgColLen |
|---------------------|-------------|-------------------------|-----------|
| LVJ output schema | _col0 | SELECT's key | 2.812 |
| LVJ output schema | _col1 | SELECT's value | 6.812 |
| LVJ output schema | _col8 | UDTF's exploded element | — |
| UDTF internal stats | _col0 | array expression input | 56.0 |
The UDTF branch's column generator restarts at 0, so its internal stats
use _col0 for the array expression — colliding with SELECT's _col0.
---
Processing Comparison
| Step | Original Code
| Proposed Fix |
|-----------------------|----------------------------------------------------------|---------------------------------------------|
| Expression Map | Shared: {_col0, _col1, _col8}
| Split: SELECT {_col0, _col1}, UDTF {_col8} |
| Schema | Full: [_col0, _col1, _col8]
| Split by numSelColumns |
| UDTF lookup for _col0 | Looks up _col0 in udtfStats → finds array's
_col0 (56.0) | _col0 not in udtfExprMap → skipped |
| UDTF lookup for _col8 | _col8 → Column[col], not found in udtfStats
| _col8 → Column[col], not found in udtfStats |
| Merge _col0 | MAX(2.812, 56.0) = 56.0
| No collision → 2.812 |
---
Final Column Statistics
| Column | Original Code | Proposed Fix |
|-----------------|---------------|--------------|
| _col0 avgColLen | 56.0 ✗ | 2.812 ✓ |
| _col1 avgColLen | 6.812 | 6.812 |
| Per-row total | 62.812 bytes | 9.624 bytes |
Data Size — LVJ Debug Output (500 rows)
| | Original Code | Proposed Fix |
|-------------|---------------|--------------|
| Calculation | 62.812 × 500 | 9.624 × 500 |
| Total | 31,406 bytes | 4,812 bytes |
---
Data Size — EXPLAIN Output (500 rows)
| Column | Original Code | Proposed Fix |
|-----------------|---------------|--------------|
| key avgColLen | 140 ✗ | 87 ✓ |
| value avgColLen | 91 | 91 |
| Per-row total | 231 bytes | 178 bytes |
| | Original Code | Proposed Fix |
|-------------|---------------|--------------|
| Calculation | 231 × 500 | 178 × 500 |
| Total | 115,500 bytes | 89,000 bytes |`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]