EmilySun621 opened a new pull request, #5114:
URL: https://github.com/apache/texera/pull/5114
Before building any workflow, she clicks "📊 Profile Data" and instantly
sees: 150 rows, 6 columns, zero missing values, Quality Score 100/100,
"Species" auto-detected as the target variable, "Id" flagged as an ID column to
drop. She hasn't written a single operator yet, but she already knows her data.
What We Built
Click any CSV File Scan operator → "📊 Profile Data" button in the properties
panel → a profiling modal opens with full dataset analysis. No need to run the
workflow. Real data, real statistics, computed on the fly.
Data Quality Score (0-100)
A single number that tells you how clean your data is, with sub-scores
explaining why:
Completeness (missing value percentage)
Duplicates
Outliers (beyond 3σ)
Constant columns (zero information)
High-cardinality categoricals (likely IDs)
Class imbalance
Score 90-100 = green "Excellent", 70-89 = orange "Good", 50-69 = orange
"Needs attention", 0-49 = red "Poor quality"
Suggested Cleaning Actions
Rule-based suggestions derived from the profiling data — no LLM, no
hallucination:
"Impute HbA1c missing values — 12.3% missing, use median imputation" → [Add
to Workflow]
"Drop constant column smoker_flag — only 1 unique value" → [Add to Workflow]
"Remove 23 duplicate rows — 3.0% duplicates" → [Add to Workflow]
"patient_id looks like an ID column — 100% unique, drop before modeling" →
[Add to Workflow]
"Review outliers in income — 42 outliers (5.5%) beyond 3σ" → [Copy hint]
Each suggestion has a severity badge (critical/warning/info) and an action
button.
Column Role Detection (auto-detected)
Heuristic pattern matching classifies each column's role in a ML pipeline:
🎯 Possible target: columns named "target", "label", "class", "outcome", or
low-cardinality categoricals
🏷️ ID: high-cardinality columns or names matching "id", "patient_id", "index"
📊 Feature: numeric and categorical columns suitable for modeling
📅 Datetime: date/time columns
⚪ Constant: single-value columns (flag for removal)
Summary at top: "1 possible target: Species, 1 ID: Id, 4 features:
SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm"
Per-Column Statistics
Each column card shows:
Name + data type badge (numeric/categorical) + role badge
Numeric: mean, median, std, min, max, range
Numeric: inline SVG histogram (10 bins)
Categorical: unique count, top values with counts
Missing value warning (red if > 10%)
Role-based suggestion ("Use as input feature", "Drop before modeling")
Overview Section
Three tabs:
Columns: all column cards with stats and histograms
Missing: missing value summary across columns
Correlations: correlation matrix for numeric columns
Real Data Analysis
Not mock data — the panel reads the actual CSV file through Texera's
file-service API, parses it, and computes all statistics in real time. Falls
back to mock data only if the API call fails.
Demo
Open Iris workflow → click CSV File Scan operator
Properties panel shows "📊 Profile Data" button
Click → Data Profile modal opens
Quality Score: 100/100 "Excellent" (Iris is a clean dataset)
Column Roles: Species = possible target, Id = ID column, 4 features
Scroll columns: see histograms for SepalLength, PetalWidth, etc.
No suggested cleaning actions (Iris is clean) — switch to diabetes dataset
to show suggestions
<img width="1327" height="895" alt="Screenshot 2026-05-16 at 12 38 25 PM"
src="https://github.com/user-attachments/assets/c7accb42-51db-46f8-a5eb-3a01b7514949"
/>
Files Changed
New files:
workspace/component/data-profiling-panel/data-profiling.types.ts — Profile,
Column, Suggestion, Role types
workspace/component/data-profiling-panel/data-profiling.utils.ts —
computeQualityScore, generateSuggestions, detectColumnRoles
workspace/component/data-profiling-panel/data-profiling.service.ts — Fetches
CSV via file-service, parses, computes stats
workspace/component/data-profiling-panel/data-profiling-panel.component.* —
Main panel UI
workspace/component/data-profiling-panel/data-profiling-modal.component.ts —
Modal wrapper
Modified (additive only):
operator-property-edit-frame component — "📊 Profile Data" button for
CSV/scan operators
Testing
Angular typecheck: clean
Profile button appears on CSV File Scan operators
Real Iris.csv data: 150 rows, 6 columns, Quality Score 100
Column role detection: Species = target, Id = ID, 4 features
Histograms render correctly for numeric columns
Suggestions generate for datasets with issues (mock diabetes data)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]