Young-Leo opened a new pull request, #816:
URL: https://github.com/apache/tsfile/pull/816
### Summary
This PR teaches the Python `TsFileDataFrame` to read **tree-model** TsFiles
(in addition to the existing table-model support) and, in the same series of
commits, slims the underlying dataset index so it scales to wide / sparse
schemas without paying for phantom `(device, field)` cells.
The user-facing dataset surface (`__len__`, `list_timeseries`, `__getitem__`,
`.loc`, aligned reads) is **unchanged** for both models — tree-model files
just become loadable through the same API.
### What changed
**Tree-model support**
- Detect the model kind at reader open. An empty table-schema map ⇒ tree
model; otherwise table model.
- For tree files, synthesize one virtual `TableEntry`:
- table name = the shared root segment of every device path
- tag columns = `_col_1 .. _col_N` (one positional column per remaining
path segment, padded with `None` for shorter devices)
- fields = union of measurements across all devices
- Per-device measurement ownership is preserved by registering only the
`(device_id, field_idx)` pairs that are actually written on disk in
`series_stats_by_ref`, so the dataset never advertises phantom series.
- Tree-model reads route through `query_table_on_tree` with client-side
device filtering. This works around two cwrapper limitations on the
current native build (`query_tree_by_row` rejects multi-segment device
paths, and successive `query_table_on_tree` calls leak duplicate `col_*`
columns); both are documented inline at the cwrapper boundary.
- Tree-mode rendering: drop the leading table column, use `_col_i`
headers, print `None` tag cells as `"None"`, and surface the model
marker in the repr header.
- Mixing table-model and tree-model files in one load set is rejected
with a clear error.
**Dataset index slim-down**
- `SeriesStats` becomes a `NamedTuple` (~120 B vs. the previous ~360 B
per-series dict).
- `_DerivedCache` removed; lookups computed lazily on top of existing
state.
- Per-reader `device_refs: List[List[DeviceRef]]` collapsed into a
pre-aggregated `device_time_bounds: List[Tuple[Optional[int],
Optional[int]]]`, so `_query_aligned` reads bounds in O(1).
- Drop the redundant `series_ref_set` (use `series_ref_map` keys).
- **Phase 6**: unify table/tree semantics so the table-model branch no
longer pads `series_stats_by_ref` with empty placeholders for
schema-declared but never-written cells. The dataset view is now
strictly “real devices × real fields” in both models.
- Cleanup: rename `_LogicalIndex` → `_DataFrameCatalog`, shorten the five
internal field names (`devices` / `device_index` / `device_time_bounds`
/ `series` / `series_shards`), inline the now-trivial
`iter_owned_series_refs` wrapper.
### New public API
- `TsFileDataFrame.model` — read-only model marker (`"table"` or `"tree"`).
- `TsFileDataFrame.list_timeseries_metadata()` — per-series metadata as a
flat tabular view (works identically for both models).
### Compatibility
- No changes to the existing dataset surface. Existing user code that
loads table-model TsFiles continues to work without modification.
- No changes to the on-disk format, the cwrapper, or the C++/Java sides.
- `SeriesStats` integer fields tighten from `Optional[int]` to `int`. The
surrounding `get_series_info_by_ref` still exposes them as the existing
dict shape, so callers do not see an API change.
### Memory impact
Two benches were used because the wins land in different shapes of
schema.
**Bench A — 30 k devices × 1 field, full density**
| Step | Tracked size | Δ |
|------|---:|---:|
| baseline | 81.20 MB | — |
| `SeriesStats` NamedTuple | 70.67 MB | −10.53 MB |
| `_DerivedCache` removal | 59.82 MB | −10.85 MB |
| `device_time_bounds` aggregation | 56.40 MB | −3.42 MB |
| `series_ref_set` removal | 54.40 MB | −2.00 MB |
| drop phantom `(device, field)` cells | 54.40 MB | 0 |
End-to-end: **81.20 → 54.40 MB (−33 %)**. Dropping phantom cells brings
nothing here because every device writes the single declared field;
there are no skipped cells to prune.
**Bench B — 5 k devices × 5 fields × 60 % density (sparse / wide)**
| Component | Before | After |
|------------------------|------:|------:|
| `series_ref_map` | 15.93 MB | 9.55 MB |
| `series_stats_by_ref` | 7.53 MB | 4.51 MB |
| **tracked total** | **26.50 MB** | **16.30 MB** |
Dropping phantom cells alone brings **−38 %** on this fixture; the
sparser and wider the schema, the larger the win.
### Testing
- `python -m pytest python/tests/test_tsfile_dataset.py` → 41 / 41 pass.
- Four new tree-model tests cover: metadata + repr layout, single-series
read, multi-field aligned read, `list_timeseries_metadata` column
shape, and the mixed-model load rejection.
- One new sparse-schema test
(`test_dataset_omits_table_model_phantom_series_for_skipped_cells`)
proves Tablet-skipped cells stay out of `list_timeseries`, `__len__`,
`series_ref_map`, and that `tsdf[skipped_path]` raises `KeyError`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]