Hi all,

I'm happy to share some news that connects TsFile with the broader AI/ML
world: Hugging Face's `datasets` library now has native, built-in support
for the TsFile format. The pull request was merged on June 1st:

  https://github.com/huggingface/datasets/pull/8160

For those less familiar with that ecosystem, a quick introduction.

About Hugging Face
------------------
Hugging Face is the most widely used open hub for the AI/ML community. Its
Hub hosts hundreds of thousands of openly shared models and datasets, and
the companion `datasets` Python library is the standard tool practitioners
use to load training data. A single load_dataset(...) call handles
downloading, caching, streaming of larger-than-memory data, and conversion
to PyTorch / TensorFlow / JAX / NumPy / Arrow. Anything published to the Hub
gets free hosting, versioning, and an automatic online preview. In short:
once data is on the Hub in a format `datasets` understands, the whole
community can load it in one line -- and TsFile is now one of those formats.

What the integration does
-------------------------
.tsfile files are auto-detected, so loading is simply:

  from datasets import load_dataset
  ds = load_dataset("tsfile", data_files="my_data.tsfile")

The loader is time-series-aware and follows the TsFile table model rather
than a generic tabular layout:

  - It emits one row per device. TAG columns become scalar strings, while
    the time column and each FIELD become Arrow list<...> columns holding
    that device's full series.
  - start_time / end_time filters are pushed down to TsFile's internal time
    index, so only the matching blocks are read from disk.
  - It also handles schema evolution across files (column union + IoTDB
    numeric widening), table/column selection, timestamp unit & timezone,
    and configurable batching for memory control.

It relies on the `tsfile` PyPI package (pip install "tsfile>=2.3.0"), lazily
imported so users who don't touch TsFile data pay no cost.

Documentation
-------------
The change is already merged into main. The official docs at
huggingface.co/docs/datasets are expected to reflect it in about 30 days,
but the rendered guide is viewable right now:

  https://moon-ci-docs.huggingface.co/docs/datasets/pr_8160/en/tsfile_load

A call to action
----------------
There is already a working example dataset on the Hub -- tsfile/lotsa_data
-- which you can load directly with load_dataset("tsfile/lotsa_data").

If you have time-series datasets, please consider publishing them to the
Hugging Face Hub as .tsfile files. No conversion is required: they are
auto-detected and become usable by anyone with a single load_dataset call.
This is a great opportunity to make TsFile a first-class format for
time-series data in the AI community and to grow the public collection of
openly available time-series datasets.

Happy to answer any questions.

Best regards,
----------------
Yuan Tian

Reply via email to