C-Loftus opened a new issue, #1080: URL: https://github.com/apache/iceberg-go/issues/1080
### Question Thank you for the great work on iceberg-go! ## Question Are there any best practices for writing columns of string data where there is high cardinality but a finite set of values? (i.e. to improve parquet row group metadata and scan performance?) ## Context I want to write a column like `project_identifier` via iceberg-go that has many string values like `foo_123`, `bar_123`, `foo_baz_123`, `foo_bar_123`, `foo_test_123` .... etc. This is a finite set but it is high cardinality (say 1000+ values) I don't want to transform the data and separate it into more tables if possible. However, since these are all string values, scans can be rather slow (as I saw here with `DELETE`s https://github.com/apache/iceberg-go/issues/1077) since strings provide less useful row group statistics metadata to my understanding. However, I was thinking that if the underlying parquet files were partitioned by the value of `project_identifier` (for instance, parquet file 1 contains all rows with `foo_*` and parquet file 2 contains all rows with `bar_*`) then the row group statistics would be much more useful. However, I was unclear how to accomplish this (i.e. is it possible to partition on a substring?) and how https://github.com/apache/iceberg-go/pull/931 might affect this when dictionary encoding is added. Thank you very much -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
