tanishqgandhi1908 opened a new pull request, #5099:
URL: https://github.com/apache/texera/pull/5099

   ## Motivation
   
   Texera's result pane has historically been a static, page-by-page table 
viewer with a default page size of five rows. Users could glance at operator 
outputs and search by column **name**, but they could not interact with the 
data the way they would in a modern spreadsheet tool — no row-level filtering, 
no sorting, no full-data search, and no way to see, at a glance, *what an 
operator actually did to its input*.
   
   That meant every debugging or exploration session looked roughly the same:
   
   1. Click an operator.
   2. Click through paginated pages.
   3. Switch to the upstream operator.
   4. Click through *its* pages.
   5. Mentally diff the two in your head.
   
   It worked, but it was slow, fiddly, and easy to get wrong on wide or large 
tables.
   
   This PR rethinks the result pane around two ideas:
   
   1. **Treat the result pane like a spreadsheet, not a static table.** Sort, 
filter, search, reorder columns, hide columns, pin columns — all without 
leaving the operator. Make it work at scale by pushing the heavy lifting down 
to Iceberg.
   2. **Make every operator self-explanatory.** Right above the data, show the 
user what changed compared to the upstream operator: row delta, column delta, 
schema diff. So instead of mentally diffing two tables, you *see* the diff 
inline.
   
   ## What changed (story version)
   
   ### Phase 1 — From `nz-table` to ag-grid Community
   
   The old `nz-table` view rendered every column to the DOM and capped at five 
rows per page. That worked for toy data but felt cramped, didn't sort or 
filter, and couldn't survive a 200-column table.
   
   We swapped it for **ag-grid Community** (MIT-licensed, Apache-compatible) 
using the **Infinite Row Model** wired into Texera's existing WebSocket 
pagination protocol via a custom \`IDatasource\`. Out of the box, the user now 
gets:
   
   - Sort + per-column filter menus
   - Column reorder via drag
   - Column hide/show via a toggle dropdown
   - Column pin (left/right) via header context menu
   - DOM column virtualization so 200-column tables render smoothly
   - Pagination with **auto-fit page size** — resize the dock and the page size 
adjusts to the visible space
   
   The grid is themed against Texera's existing Ant Design palette (no garish 
ag-grid defaults), and the per-column stats (Min / Max / Non-Null / category %) 
that lived in the old header are restored via a custom header component — same 
data, better layout.
   
   ### Phase 2 — Backend pushdown
   
   Spreadsheet UX is only useful if it scales. Texera stores operator results 
as **Iceberg / Parquet**, which can prune entire data files by partition + 
min/max stats and push predicates into the Parquet reader. We extended the 
protocol and the storage layer to take advantage of that:
   
   - \`ResultPaginationRequest\` now carries optional \`filters\`, \`sorts\`, 
and \`rowSearch\` fields.
   - \`VirtualDocument\` gains \`getRangeWithQuery\` + \`countWithQuery\` 
(defaulted to safe fallbacks so non-Iceberg documents keep working).
   - A new \`IcebergPredicateBuilder\` translates the wire-format 
\`ColumnFilter\` into Iceberg \`Expressions\` with **type-aware value parsing** 
per column type (no silent string-coercion bugs).
   - \`IcebergDocument\` implements both methods: predicate pushdown for ops 
Iceberg supports natively, residual evaluation in memory for \`contains\` / 
\`endsWith\` / \`rowSearch\`, and an in-memory sort capped at 
\`storage.result.sort.max-rows\` (default 100k).
   
   When sort is requested but the matched count exceeds the cap, the backend 
returns rows in scan order with a \`sortSkipped\` flag, and the UI shows a 
friendly banner explaining how to narrow the filter. (Iceberg cannot push ORDER 
BY into the Parquet reader — sort is the one place we have to spend JVM heap.)
   
   ### Phase 3 — Full-data row search
   
   A debounced \`Search rows...\` input above the grid sends a \`rowSearch\` 
string down to the backend, which compiles it into a multi-column \`contains\` 
predicate over all string columns. This is the **first** real \"search inside 
the data\" experience in the result pane — the existing column-name search 
continues to work alongside it.
   
   ### Phase 4 — The transformation diff
   
   This is the most ambitious idea: every operator, at a glance, tells you what 
it did.
   
   A compact strip above the grid renders:
   
   - **Left pill**: upstream operator name with its row count and column count 
(taken from the frontend's per-operator cache — no extra backend calls).
   - **Middle**: row delta (e.g. \`↓ -149 rows (-99.3%)\`, color-coded 
green/red/neutral) and column delta (e.g. \`+2 -1 ⇄1 cols\` or \`5 cols · 
unchanged\`).
   - **Right pill**: current operator.
   
   Click the strip and it expands inline (no popup) into a detail drawer with:
   
   - A two-row Before / After bar visualisation of row counts (scaled relative 
to the larger side, with the actual numbers right-aligned for clarity).
   - Coloured tag lists for **Removed**, **Added**, **Type-changed**, and 
**Kept** columns.
   
   For source operators with no input, the strip shows a friendly \`▶ Source 
operator\` chip. For multi-input operators (joins, unions), it collapses to \`⛙ 
Combined from N inputs\` and defers the pairwise diff for a future iteration. 
All of this is computed from the data the frontend already maintains in 
\`WorkflowResultService\` — **zero new backend round trips**.
   
   ### Layout — bottom dock instead of floating modal
   
   The result panel itself was a draggable floating popup. We turned it into a 
**fixed bottom dock**: full viewport width, top-edge resize handle for height, 
no drag-to-move, no \"return to corner\" widget. Clicking a row no longer opens 
a modal — instead an inline row inspector slides in below the grid with a JSON 
tree view, prev/next/close, and visual selection on the corresponding row in 
the grid.
   
   ## Architectural notes
   
   - **Frontend memory is bounded** regardless of dataset size — ag-grid's row 
+ column virtualization keeps DOM at ~20–30 row nodes; the page cache evicts 
LRU at ~2 000 rows.
   - **The frontend page cache is populated on response** for the unfiltered 
fast path, so paging back and forth costs zero WS round-trips after the first 
visit.
   - **Wire format stays backward-compatible**: \`columnOffset\` / 
\`columnLimit\` / \`columnSearch\` are kept on \`ResultPaginationRequest\` with 
their defaults for the Python SDK and any external callers. New frontend simply 
stops setting them; the bare-minimum payload also avoids a Jackson edge case 
where JS \`Number.MAX_SAFE_INTEGER\` overflows Scala's \`Int\`.
   - **Filter / sort / rowSearch fields are elided** from the wire when empty, 
so the no-query path is byte-identical to the pre-PR shape.
   
   ## Risks and mitigations (also covered in the plan doc)
   
   - **Sort beyond 100 k rows** — returned unsorted with a banner; user narrows 
the filter to get sort back. Spill-to-disk sort is a follow-up.
   - **Filter value typing** — centralized in \`IcebergPredicateBuilder\` with 
per-Iceberg-type parsers; ag-grid picks the right filter component per column 
type so bad input is rare at the UI level.
   - **Streaming results** — the existing \`dirtyPageIndices\` hook maps to 
\`gridApi.purgeInfiniteCache()\` so scroll position stays put while new rows 
land.
   - **Bundle size** — ag-grid adds ~300 KB gzipped. We register only the 
Community modules we use; the result pane is a good candidate for future 
lazy-loading.
   - **License** — ag-grid Community is MIT, which is [Category A under Apache 
policy](https://www.apache.org/legal/resolved.html#category-a). No commercial 
license is used or required.
   
   ## Future ideas this PR enables
   
   The same data plane (per-operator schema + row count cache + WebSocket 
pagination) makes these reasonable follow-ups:
   
   - Sort spill-to-disk via a temp Iceberg sort transform — eliminates the 100 
k cap.
   - Filtered-count caching keyed by \`hash(filters, rowSearch)\` so count 
doesn't recompute per page.
   - Cross-operator comparison (\"diff this op's output against the same op 
from a previous run\") — reuses the schema-diff machinery.
   - Bloom filters or inverted indices for fast row-search on huge string 
columns.
   
   ## Test plan
   
   - [ ] Frontend builds clean (\`yarn build\`) and lints (\`yarn lint\`).
   - [ ] Backend Scala compiles (\`sbt 
WorkflowExecutionService/Compile/compile\`).
   - [ ] Run the **Iris CSV** sample workflow:
     - [ ] Sort any column → rows reorder across pages.
     - [ ] Filter \`SepalLengthCm > 5\` via the column header menu → grid + row 
count update; banner stays hidden.
     - [ ] Type into the \"Search rows...\" box → debounced backend round trip; 
matching rows appear.
     - [ ] Click a row → bottom inspector slides in; prev/next walks rows; × 
closes.
     - [ ] Resize the dock from the top edge → page size auto-adjusts; data 
unchanged.
     - [ ] Click the transformation strip → drawer expands showing schema diff 
with column tags.
   - [ ] On a multi-million-row table, apply a narrowing filter on a 
partitioned column → confirm the backend logs show Iceberg pruning data files.
   - [ ] Force a sort over more than 100 k matched rows → confirm the yellow 
\"Too many rows to sort\" banner appears and the grid shows scan order.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to