Use Iceberg for a time-series data lake

Yi Chen Wed, 16 Sep 2020 07:41:47 -0700

Hi Iceberg Dev,


We are looking into Iceberg for a data lake solution to replace a legacy
system been there for many years. Our data(~10+PB in total) is time-series
tabular data. We built a proof-of-concept earlier, which ended up with a
very similar design like Iceberg, especially on the table spec.


However, our use case has a few special requirements (supported by our
legacy system) that are missing in Iceberg today:

   - Our applications always expect sorted rows (by timestamp) when reading
   the time-series data from the data lake.
   - Our users do not want to deal with table partitioning. They expect the
   storage layer (or the data-lake middle layer) to optimize the partition for
   them.

Our legacy system supports both by enforcing row order at write and having
a background service that consolidates small data files into larger ones to
optimize storage usage for better query performance. (The system does
merge-on-read that resolves the intersected time ranges which have not been
consolidated yet.) After we switch to Iceberg, to continue supporting the
above features, it looks like we have to:

   1. use a special partition spec that always creates a single partition
   for any table,
   2. build a background consolidation service on top of Iceberg's
   compaction API
   3. build a new writer (we use Arrow) that enforce write order.

Would that be too much customization on top of what Iceberg has today? Or
do you even consider this as a legitimate use case for Iceberg in the
future?


We noticed many ongoing efforts around topics like SortOrder,
Merge-on-Read, Row-delete, etc. that seem to be very relevant. We are happy
to contribute to the community if our use case makes sense to Iceberg.


Thanks,

Yi

Use Iceberg for a time-series data lake

Reply via email to