[PR] [Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run [texera]

via GitHub Sat, 16 May 2026 12:47:13 -0700


EmilySun621 opened a new pull request, #5114:
URL: https://github.com/apache/texera/pull/5114


   Before building any workflow, she clicks "📊 Profile Data" and instantly 
sees: 150 rows, 6 columns, zero missing values, Quality Score 100/100, 
"Species" auto-detected as the target variable, "Id" flagged as an ID column to 
drop. She hasn't written a single operator yet, but she already knows her data.
   What We Built
   Click any CSV File Scan operator → "📊 Profile Data" button in the properties 
panel → a profiling modal opens with full dataset analysis. No need to run the 
workflow. Real data, real statistics, computed on the fly.
   Data Quality Score (0-100)
   A single number that tells you how clean your data is, with sub-scores 
explaining why:
   
   Completeness (missing value percentage)
   Duplicates
   Outliers (beyond 3σ)
   Constant columns (zero information)
   High-cardinality categoricals (likely IDs)
   Class imbalance
   
   Score 90-100 = green "Excellent", 70-89 = orange "Good", 50-69 = orange 
"Needs attention", 0-49 = red "Poor quality"
   Suggested Cleaning Actions
   Rule-based suggestions derived from the profiling data — no LLM, no 
hallucination:
   
   "Impute HbA1c missing values — 12.3% missing, use median imputation" → [Add 
to Workflow]
   "Drop constant column smoker_flag — only 1 unique value" → [Add to Workflow]
   "Remove 23 duplicate rows — 3.0% duplicates" → [Add to Workflow]
   "patient_id looks like an ID column — 100% unique, drop before modeling" → 
[Add to Workflow]
   "Review outliers in income — 42 outliers (5.5%) beyond 3σ" → [Copy hint]
   
   Each suggestion has a severity badge (critical/warning/info) and an action 
button.
   Column Role Detection (auto-detected)
   Heuristic pattern matching classifies each column's role in a ML pipeline:
   
   🎯 Possible target: columns named "target", "label", "class", "outcome", or 
low-cardinality categoricals
   🏷️ ID: high-cardinality columns or names matching "id", "patient_id", "index"
   📊 Feature: numeric and categorical columns suitable for modeling
   📅 Datetime: date/time columns
   ⚪ Constant: single-value columns (flag for removal)
   
   Summary at top: "1 possible target: Species, 1 ID: Id, 4 features: 
SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm"
   Per-Column Statistics
   Each column card shows:
   
   Name + data type badge (numeric/categorical) + role badge
   Numeric: mean, median, std, min, max, range
   Numeric: inline SVG histogram (10 bins)
   Categorical: unique count, top values with counts
   Missing value warning (red if > 10%)
   Role-based suggestion ("Use as input feature", "Drop before modeling")
   
   Overview Section
   Three tabs:
   
   Columns: all column cards with stats and histograms
   Missing: missing value summary across columns
   Correlations: correlation matrix for numeric columns
   
   Real Data Analysis
   Not mock data — the panel reads the actual CSV file through Texera's 
file-service API, parses it, and computes all statistics in real time. Falls 
back to mock data only if the API call fails.
   Demo
   
   Open Iris workflow → click CSV File Scan operator
   Properties panel shows "📊 Profile Data" button
   Click → Data Profile modal opens
   Quality Score: 100/100 "Excellent" (Iris is a clean dataset)
   Column Roles: Species = possible target, Id = ID column, 4 features
   Scroll columns: see histograms for SepalLength, PetalWidth, etc.
   No suggested cleaning actions (Iris is clean) — switch to diabetes dataset 
to show suggestions
   
   <img width="1327" height="895" alt="Screenshot 2026-05-16 at 12 38 25 PM" 
src="https://github.com/user-attachments/assets/c7accb42-51db-46f8-a5eb-3a01b7514949";
 />
   
   
   Files Changed
   New files:
   
   workspace/component/data-profiling-panel/data-profiling.types.ts — Profile, 
Column, Suggestion, Role types
   workspace/component/data-profiling-panel/data-profiling.utils.ts — 
computeQualityScore, generateSuggestions, detectColumnRoles
   workspace/component/data-profiling-panel/data-profiling.service.ts — Fetches 
CSV via file-service, parses, computes stats
   workspace/component/data-profiling-panel/data-profiling-panel.component.* — 
Main panel UI
   workspace/component/data-profiling-panel/data-profiling-modal.component.ts — 
Modal wrapper
   
   Modified (additive only):
   
   operator-property-edit-frame component — "📊 Profile Data" button for 
CSV/scan operators
   
   Testing
   
   Angular typecheck: clean
   Profile button appears on CSV File Scan operators
   Real Iris.csv data: 150 rows, 6 columns, Quality Score 100
   Column role detection: Species = target, Id = ID, 4 features
   Histograms render correctly for numeric columns
   Suggestions generate for datasets with issues (mock diabetes data)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [Hackathon] feat: Data Profiling Panel — Instant Dataset Analysis Before You Run [texera]

Reply via email to