Hi all, I'm happy to share some news that connects TsFile with the broader AI/ML world: Hugging Face's `datasets` library now has native, built-in support for the TsFile format. The pull request was merged on June 1st:
https://github.com/huggingface/datasets/pull/8160 For those less familiar with that ecosystem, a quick introduction. About Hugging Face ------------------ Hugging Face is the most widely used open hub for the AI/ML community. Its Hub hosts hundreds of thousands of openly shared models and datasets, and the companion `datasets` Python library is the standard tool practitioners use to load training data. A single load_dataset(...) call handles downloading, caching, streaming of larger-than-memory data, and conversion to PyTorch / TensorFlow / JAX / NumPy / Arrow. Anything published to the Hub gets free hosting, versioning, and an automatic online preview. In short: once data is on the Hub in a format `datasets` understands, the whole community can load it in one line -- and TsFile is now one of those formats. What the integration does ------------------------- .tsfile files are auto-detected, so loading is simply: from datasets import load_dataset ds = load_dataset("tsfile", data_files="my_data.tsfile") The loader is time-series-aware and follows the TsFile table model rather than a generic tabular layout: - It emits one row per device. TAG columns become scalar strings, while the time column and each FIELD become Arrow list<...> columns holding that device's full series. - start_time / end_time filters are pushed down to TsFile's internal time index, so only the matching blocks are read from disk. - It also handles schema evolution across files (column union + IoTDB numeric widening), table/column selection, timestamp unit & timezone, and configurable batching for memory control. It relies on the `tsfile` PyPI package (pip install "tsfile>=2.3.0"), lazily imported so users who don't touch TsFile data pay no cost. Documentation ------------- The change is already merged into main. The official docs at huggingface.co/docs/datasets are expected to reflect it in about 30 days, but the rendered guide is viewable right now: https://moon-ci-docs.huggingface.co/docs/datasets/pr_8160/en/tsfile_load A call to action ---------------- There is already a working example dataset on the Hub -- tsfile/lotsa_data -- which you can load directly with load_dataset("tsfile/lotsa_data"). If you have time-series datasets, please consider publishing them to the Hugging Face Hub as .tsfile files. No conversion is required: they are auto-detected and become usable by anyone with a single load_dataset call. This is a great opportunity to make TsFile a first-class format for time-series data in the AI community and to grow the public collection of openly available time-series datasets. Happy to answer any questions. Best regards, ---------------- Yuan Tian
