tanmayrauth commented on issue #1080:
URL: https://github.com/apache/iceberg-go/issues/1080#issuecomment-4453374782
Yes, you can partition on a substring, the truncate[N] partition transform
does exactly this for strings. It takes the first N characters as the partition
value, so all rows sharing a prefix land in the same data files.
For your pattern (foo_123, foo_baz_123, bar_123, etc.), something like
truncate[4] would group all foo_ rows into one partition and bar_ into another.
When scanning with a predicate on project_identifier, the engine prunes
partitions whose truncated prefix can't match - eliminating files without
reading them.
-- In Spark SQL equivalent terms:
PARTITION BY truncate(4, project_identifier)
In iceberg-go, this is iceberg.TruncateTransform{Width: N} on the
partition spec. Pick a width that captures the meaningful prefix structure of
your values.
Why this helps row group stats: Since truncate preserves order, all values
within a partition file share a common prefix, making Parquet min/max
statistics much tighter. You'll get effective row-group skipping even within
partition files.
Complementary strategy - sort order: Setting a sort order on
project_identifier ensures rows are physically sorted within each file. This
makes min/max column chunk stats as tight as possible. The combination of
truncate partitioning + sorted writes is the standard approach for this
scenario.
Alternative - bucket[N]: If your prefixes aren't uniform length or you
just want even distribution, bucket[N] hashes values into N buckets. Trade-off:
only equality/IN predicates benefit from pruning (no range scan benefit since
bucket doesn't preserve order). With ~1000 distinct values, 16–64 buckets works
well.
Regarding #931 (dictionary encoding): Dictionary encoding is orthogonal -
it operates within a single Parquet file, replacing repeated strings with
integer codes. It helps most when cardinality within a row group is low, which
sorting naturally achieves. So these strategies stack: partition → sort →
dictionary encode.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]