Dracylfrr commented on issue #5693:
URL: https://github.com/apache/texera/issues/5693#issuecomment-4793886505
Hi @chenlica and @Yicong-Huang, thank you for the feedback. That makes sense
to me.
I understand the concern that the current **Column Summary Statistics**
operator may be too close to existing functionality:
1. The result table statistics feature already provides a quick summary for
display and inspection.
2. The Aggregate operator is the standard workflow building block when users
want to manually compute statistics as an output dataset.
I agree that if this operator is only viewed as a preconfigured aggregation,
then it becomes more like syntax sugar over Aggregate. I think repositioning it
as a **Describe** operator would make the purpose clearer.
The way I see it, the **Describe** operator would focus on describing the
input table as a whole, similar to `df.describe()`, `df.shape`, or SQL
`DESCRIBE`, instead of only acting as a shortcut for aggregation.
Here are two rough figures to show the difference I had in mind.
**Figure 1: Existing approaches**
```text
Input Table
|
|------------------------------|
| |
v v
Result Table Statistics Aggregate Operator
(UI quick summary) (user-configured aggregation)
| |
v v
Displayed in result panel Output dataset for downstream operators
```
In the existing approach, result table statistics are useful for quick UI
inspection, while the Aggregate operator is useful when users know which
aggregations they want to configure.
**Figure 2: Proposed Describe operator**
```text
Input Table
|
v
Describe Operator
|
v
Table Description Dataset
Example output:
columnName | dataType | rowCount | nullCount | nonNullCount | minValue |
maxValue | meanValue
age | integer | 1000 | 20 | 980 | 18 | 85
| 42.5
name | string | 1000 | 5 | 995 | - | -
| -
createdAt | datetime | 1000 | 0 | 1000 | - | -
| -
```
The intended difference is that **Describe** would generate a reusable
table/profile description as workflow output. This could then be connected to
downstream operators, exported, or reused in a repeatable workflow. So the
focus would be less on “preconfigured aggregation” and more on “describe this
table.”
I can update the PR direction toward this **Describe** operator framing if
this matches what you had in mind.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]