Young-Leo opened a new pull request, #816:
URL: https://github.com/apache/tsfile/pull/816

   ### Summary
   
   This PR teaches the Python `TsFileDataFrame` to read **tree-model** TsFiles
   (in addition to the existing table-model support) and, in the same series of
   commits, slims the underlying dataset index so it scales to wide / sparse
   schemas without paying for phantom `(device, field)` cells.
   
   The user-facing dataset surface (`__len__`, `list_timeseries`, `__getitem__`,
   `.loc`, aligned reads) is **unchanged** for both models — tree-model files
   just become loadable through the same API.
   
   ### What changed
   
   **Tree-model support** 
   - Detect the model kind at reader open. An empty table-schema map ⇒ tree
     model; otherwise table model.
   - For tree files, synthesize one virtual `TableEntry`:
     - table name = the shared root segment of every device path
     - tag columns = `_col_1 .. _col_N` (one positional column per remaining
       path segment, padded with `None` for shorter devices)
     - fields     = union of measurements across all devices
   - Per-device measurement ownership is preserved by registering only the
     `(device_id, field_idx)` pairs that are actually written on disk in
     `series_stats_by_ref`, so the dataset never advertises phantom series.
   - Tree-model reads route through `query_table_on_tree` with client-side
     device filtering. This works around two cwrapper limitations on the
     current native build (`query_tree_by_row` rejects multi-segment device
     paths, and successive `query_table_on_tree` calls leak duplicate `col_*`
     columns); both are documented inline at the cwrapper boundary.
   - Tree-mode rendering: drop the leading table column, use `_col_i`
     headers, print `None` tag cells as `"None"`, and surface the model
     marker in the repr header.
   - Mixing table-model and tree-model files in one load set is rejected
     with a clear error.
   
   **Dataset index slim-down** 
   - `SeriesStats` becomes a `NamedTuple` (~120 B vs. the previous ~360 B
     per-series dict).
   - `_DerivedCache` removed; lookups computed lazily on top of existing
     state.
   - Per-reader `device_refs: List[List[DeviceRef]]` collapsed into a
     pre-aggregated `device_time_bounds: List[Tuple[Optional[int],
     Optional[int]]]`, so `_query_aligned` reads bounds in O(1).
   - Drop the redundant `series_ref_set` (use `series_ref_map` keys).
   - **Phase 6**: unify table/tree semantics so the table-model branch no
     longer pads `series_stats_by_ref` with empty placeholders for
     schema-declared but never-written cells. The dataset view is now
     strictly “real devices × real fields” in both models.
   - Cleanup: rename `_LogicalIndex` → `_DataFrameCatalog`, shorten the five
     internal field names (`devices` / `device_index` / `device_time_bounds`
     / `series` / `series_shards`), inline the now-trivial
     `iter_owned_series_refs` wrapper.
   
   ### New public API
   
   - `TsFileDataFrame.model` — read-only model marker (`"table"` or `"tree"`).
   - `TsFileDataFrame.list_timeseries_metadata()` — per-series metadata as a
     flat tabular view (works identically for both models).
   
   ### Compatibility
   
   - No changes to the existing dataset surface. Existing user code that
     loads table-model TsFiles continues to work without modification.
   - No changes to the on-disk format, the cwrapper, or the C++/Java sides.
   - `SeriesStats` integer fields tighten from `Optional[int]` to `int`. The
     surrounding `get_series_info_by_ref` still exposes them as the existing
     dict shape, so callers do not see an API change.
   
   ### Memory impact
   
   Two benches were used because the wins land in different shapes of
   schema.
   
   **Bench A — 30 k devices × 1 field, full density**
   
   | Step | Tracked size | Δ |
   |------|---:|---:|
   | baseline                                    | 81.20 MB | — |
   | `SeriesStats` NamedTuple                    | 70.67 MB | −10.53 MB |
   | `_DerivedCache` removal                     | 59.82 MB | −10.85 MB |
   | `device_time_bounds` aggregation            | 56.40 MB |  −3.42 MB |
   | `series_ref_set` removal                    | 54.40 MB |  −2.00 MB |
   | drop phantom `(device, field)` cells        | 54.40 MB |   0       |
   
   End-to-end: **81.20 → 54.40 MB (−33 %)**. Dropping phantom cells brings
   nothing here because every device writes the single declared field;
   there are no skipped cells to prune.
   
   **Bench B — 5 k devices × 5 fields × 60 % density (sparse / wide)**
   
   | Component              | Before | After  |
   |------------------------|------:|------:|
   | `series_ref_map`       | 15.93 MB | 9.55 MB |
   | `series_stats_by_ref`  |  7.53 MB | 4.51 MB |
   | **tracked total**      | **26.50 MB** | **16.30 MB** |
   
   Dropping phantom cells alone brings **−38 %** on this fixture; the
   sparser and wider the schema, the larger the win.
   
   ### Testing
   
   - `python -m pytest python/tests/test_tsfile_dataset.py` → 41 / 41 pass.
   - Four new tree-model tests cover: metadata + repr layout, single-series
     read, multi-field aligned read, `list_timeseries_metadata` column
     shape, and the mixed-model load rejection.
   - One new sparse-schema test 
(`test_dataset_omits_table_model_phantom_series_for_skipped_cells`)
     proves Tablet-skipped cells stay out of `list_timeseries`, `__len__`,
     `series_ref_map`, and that `tsdf[skipped_path]` raises `KeyError`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to