Hi ZC C,

Adding partitions to Iceberg tables is easy, and changing them, later on,
is easy as well. The existing data will continue to exist with the
partition that it was initially written with, new data will be written
according to the active partitioning. When you rewrite the data (for
example using the Spark procedure rewrite_data_files
<https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_data_files>)
it will use the new partitioning strategy. Getting back to your question;
the harm is that you need to rewrite the data to benefit from the new
partitioning strategy.

Choosing the right partitioning strategy is something that often evolves
over time. You don't want to be too granular because that will create a lot
of files, which will cause more IO overhead. But also not too coarse since
that will read in a lot of data that you're not interested in. It helps to
look at your queries and see which fields are filtered on, and which are
your candidate partitions (one or a combination of).

Hope this helps,

Kind regards,
Fokko



Op ma 24 apr 2023 om 19:49 schreef ZC C <[email protected]>:

> We now are create a row data table, and my colleague want to add org_id as
> the partition, What is the harm of adding partition to iceberg table?

Reply via email to