I would like to store the stock price of a large number of companies in a parquet file in the form of a timeseries. If I gather the data at the end of 1 Jul, I would be writing a file such as: 1 Jul 2020, Company1,35 1 Jul 2020, Company2,46 ....
On 2 Jul, I would receive the new prices and would write it in "append" mode as: 2 Jul 2020, Company1,37 2 Jul 2020, Company2,43 ... This will result in 2 partition files being created for the same parquet file: stocks.parquet/ part0_stocks.parquet written on 1 Jul part1_stocks.parquet written on 2 Jul If this continues for years, I will have a large number of partition files created, one per day. If a client application wants to fetch the timeseries for 6 months, it will be reading several files to gather the data and may be inefficient. Is there a better way to store timeseries data in parquet?
