kunwp1 opened a new pull request, #5112:
URL: https://github.com/apache/texera/pull/5112

   ## Summary
   
   The dataset file previewer (`UserDatasetFileRendererComponent`) previously 
identified files purely by extension and showed *"Preview of the file type is 
currently not supported"* for anything outside a small allow-list. This PR 
makes it identify-and-describe a much wider set of formats and surface rich 
per-format metadata.
   
   ### What changed
   
   - **Magic-byte detection**: replaces extension-only guessing. Uses the 
[`file-type`](https://www.npmjs.com/package/file-type) library (MIT) for ~100 
common formats, plus hand-rolled signatures for Parquet (`PAR1`), Arrow 
(`ARROW1`), HDF5 (`\x89HDF\r\n\x1a\n`), NumPy `.npy` (`\x93NUMPY`), GGUF 
(`GGUF`), and Python pickle (`\x80\x02..\x05`). Extension-based refinement 
disambiguates ZIP containers (PyTorch `.pt`/`.pth`, Keras `.keras`, NumPy 
`.npz`) and gzipped R `.rds`. Text sniffing adds FASTA, FASTQ, VCF on top of 
the existing JSON / CSV / Markdown heuristics.
   
   - **Lightweight header parsing for ML formats**:
     - NumPy `.npy` → dtype, shape, byte-order, Fortran/C order
     - Safetensors → tensor count, total parameters, dtype breakdown, largest 
tensor, `__metadata__`
     - GGUF → version, tensor count, metadata KV count
   
   - **Rich metadata per type** displayed as a metadata strip above the preview:
     - **CSV / XLSX**: inferred column types (`integer` / `double` / `boolean` 
/ `date` / `string`) and null counts shown directly under each column header in 
the data table; row & column counts; sheet count for XLSX
     - **JSON**: top-level type, item/key count, max nesting depth, per-key 
types
     - **PDF**: version, page count, `/Info` dictionary (Title, Author, 
Creator, Producer), encryption flag — rendered in `<iframe>`
     - **Images**: dimensions, aspect ratio (async via `<img>.onload`)
     - **Video / audio**: duration + resolution (async via `loadedmetadata`)
     - **FASTA**: total bases, GC content (skipped for proteins), min/max/avg 
sequence length
     - **VCF**: sample count parsed from `#CHROM` header, distinct chromosomes
     - **Single-cell / R**: AnnData (`.h5ad`), Seurat (`.h5seurat`, `.rds`), 
Loom — identification + "how to load" hint
   
   - **Memory-safe rendering**: text/CSV/JSON parsing is bounded at 10 MB 
(`getPreviewSlice`) to avoid browser OOM on large files. A warning banner 
appears when truncation occurs; truncation-affected stats 
(`sequenceCountIsExact`, `variantCountIsExact`) flip accordingly. 
`turnOffAllDisplay` now clears `textContent` / `tableContent` / `currentFile` 
so switching files reclaims memory. Per-MIME size cap raised to 1 GB from the 
prior 1–50 MB.
   
   - **Async safety**: `ChangeDetectorRef` injected and `markForCheck()` called 
from media `loadedmetadata` / `<img>.onload` callbacks, preserving the existing 
default change-detection strategy while supporting an eventual OnPush migration.
   
   ### Files changed
   
   - 
`frontend/src/app/dashboard/component/user/user-dataset/user-dataset-explorer/user-dataset-file-renderer/user-dataset-file-renderer.component.ts`
 — detection logic, parsers, render dispatch, metadata getter
   - `…/user-dataset-file-renderer.component.html` — metadata strip, PDF 
iframe, truncation banner, column-type tags on table headers
   - `…/user-dataset-file-renderer.component.scss` — metadata pill / column tag 
styles
   - `…/user-dataset-file-renderer.component.spec.ts` — 28 new tests (30 total)
   - `frontend/package.json`, `frontend/yarn.lock` — `[email protected]` (MIT)
   
   ## Test plan
   
   - [x] `yarn ng test 
--include="**/user-dataset-file-renderer.component.spec.ts" --watch=false` — 
**30 / 30 passing** (existing 2 retained, 28 new covering magic-byte detection, 
extension refinement, NumPy/Safetensors/GGUF header parsing, and column type 
inference)
   - [ ] Frontend visual review: open various file types in the dataset 
previewer and verify the metadata strip + column type tags render
   - [ ] Before/after screenshots / GIFs *(not included in this draft; per 
AGENTS.md these should be added before merge)*
   
   ## Notes for reviewers
   
   - This is exploratory hackathon work; **a tracking issue should be filed 
before merge** per AGENTS.md.
   - The 1 GB preview limit still triggers a full file download from the 
dataset service. A follow-up could add HTTP Range request support so 
identify-only formats (Parquet, HDF5, pickle, model containers) fetch only the 
first 64 KB.
   - HDF5 sub-types (`.h5ad` / `.h5seurat` / `.loom`) are distinguished by 
extension because they share identical magic bytes; deep parsing would need an 
HDF5 reader (e.g. h5wasm) which is intentionally not included.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to