The usual solution is to partition the data based on the criteria you want to filter by. E.g. for Hive tables, you would partition by date and have a separate directory per date.
If you have a relatively modern version of Parquet, stats and page indices will allow the reader to filter out files based on ranges of values in the file after reading the file footers. Reading the footer takes longer than not reading the file at all, but is much faster than reading the whole file. On Sat, Jul 18, 2020 at 8:21 AM Yash Ganthe <[email protected]> wrote: > I would like to store the stock price of a large number of companies in a > parquet file in the form of a timeseries. > If I gather the data at the end of 1 Jul, I would be writing a file such > as: > 1 Jul 2020, Company1,35 > 1 Jul 2020, Company2,46 > .... > > On 2 Jul, I would receive the new prices and would write it in "append" > mode as: > 2 Jul 2020, Company1,37 > 2 Jul 2020, Company2,43 > ... > > This will result in 2 partition files being created for the same parquet > file: > stocks.parquet/ > part0_stocks.parquet written on 1 Jul > part1_stocks.parquet written on 2 Jul > > If this continues for years, I will have a large number of partition files > created, one per day. > If a client application wants to fetch the timeseries for 6 months, it will > be reading several files to gather the data and may be inefficient. > > Is there a better way to store timeseries data in parquet? >
