Re: How to incrementally store timeseries in Parquet files for efficient retrieval?

Tim Armstrong Mon, 20 Jul 2020 08:36:03 -0700

The usual solution is to partition the data based on the criteria you want
to filter by. E.g. for Hive tables, you would partition by date and have a
separate directory per date.


If you have a relatively modern version of Parquet, stats and page indices
will allow the reader to filter out files based on ranges of values in the
file after reading the file footers. Reading the footer takes longer than
not reading the file at all, but is much faster than reading the whole file.

On Sat, Jul 18, 2020 at 8:21 AM Yash Ganthe <[email protected]> wrote:

> I would like to store the stock price of a large number of companies in a
> parquet file in the form of a timeseries.
> If I gather the data at the end of 1 Jul, I would be writing a file such
> as:
> 1 Jul 2020, Company1,35
> 1 Jul 2020, Company2,46
> ....
>
> On 2 Jul, I would receive the new prices and would write it in "append"
> mode as:
> 2 Jul 2020, Company1,37
> 2 Jul 2020, Company2,43
> ...
>
> This will result in 2 partition files being created for the same parquet
> file:
> stocks.parquet/
> part0_stocks.parquet written on 1 Jul
> part1_stocks.parquet written on 2 Jul
>
> If this continues for years, I will have a large number of partition files
> created, one per day.
> If a client application wants to fetch the timeseries for 6 months, it will
> be reading several files to gather the data and may be inefficient.
>
> Is there a better way to store timeseries data in parquet?
>

Re: How to incrementally store timeseries in Parquet files for efficient retrieval?

Reply via email to