How to incrementally store timeseries in Parquet files for efficient retrieval?

Yash Ganthe Sat, 18 Jul 2020 08:22:01 -0700

I would like to store the stock price of a large number of companies in a
parquet file in the form of a timeseries.
If I gather the data at the end of 1 Jul, I would be writing a file such as:
1 Jul 2020, Company1,35
1 Jul 2020, Company2,46
....


On 2 Jul, I would receive the new prices and would write it in "append"
mode as:
2 Jul 2020, Company1,37
2 Jul 2020, Company2,43
...

This will result in 2 partition files being created for the same parquet
file:
stocks.parquet/
part0_stocks.parquet written on 1 Jul
part1_stocks.parquet written on 2 Jul

If this continues for years, I will have a large number of partition files
created, one per day.
If a client application wants to fetch the timeseries for 6 months, it will
be reading several files to gather the data and may be inefficient.

Is there a better way to store timeseries data in parquet?

How to incrementally store timeseries in Parquet files for efficient retrieval?

Reply via email to