Re: [I] Add Column Summary Statistics workflow operator [texera]

via GitHub Wed, 24 Jun 2026 14:37:33 -0700


Dracylfrr commented on issue #5693:
URL: https://github.com/apache/texera/issues/5693#issuecomment-4793886505


   Hi @chenlica and @Yicong-Huang, thank you for the feedback. That makes sense 
to me.
   
   I understand the concern that the current **Column Summary Statistics** 
operator may be too close to existing functionality:
   
   1. The result table statistics feature already provides a quick summary for 
display and inspection.
   2. The Aggregate operator is the standard workflow building block when users 
want to manually compute statistics as an output dataset.
   
   I agree that if this operator is only viewed as a preconfigured aggregation, 
then it becomes more like syntax sugar over Aggregate. I think repositioning it 
as a **Describe** operator would make the purpose clearer.
   
   The way I see it, the **Describe** operator would focus on describing the 
input table as a whole, similar to `df.describe()`, `df.shape`, or SQL 
`DESCRIBE`, instead of only acting as a shortcut for aggregation.
   
   Here are two rough figures to show the difference I had in mind.
   
   **Figure 1: Existing approaches**
   
   ```text
   Input Table
      |
      |------------------------------|
      |                              |
      v                              v
   Result Table Statistics        Aggregate Operator
   (UI quick summary)             (user-configured aggregation)
      |                              |
      v                              v
   Displayed in result panel       Output dataset for downstream operators
   ```
   
   In the existing approach, result table statistics are useful for quick UI 
inspection, while the Aggregate operator is useful when users know which 
aggregations they want to configure.
   
   **Figure 2: Proposed Describe operator**
   
   ```text
   Input Table
      |
      v
   Describe Operator
      |
      v
   Table Description Dataset
   
   Example output:
   
   columnName | dataType | rowCount | nullCount | nonNullCount | minValue | 
maxValue | meanValue
   age        | integer  | 1000     | 20        | 980          | 18       | 85  
     | 42.5
   name       | string   | 1000     | 5         | 995          | -        | -   
     | -
   createdAt  | datetime | 1000     | 0         | 1000         | -        | -   
     | -
   ```
   
   The intended difference is that **Describe** would generate a reusable 
table/profile description as workflow output. This could then be connected to 
downstream operators, exported, or reused in a repeatable workflow. So the 
focus would be less on “preconfigured aggregation” and more on “describe this 
table.”
   
   I can update the PR direction toward this **Describe** operator framing if 
this matches what you had in mind.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Add Column Summary Statistics workflow operator [texera]

Reply via email to