tanmayrauth commented on issue #1080:
URL: https://github.com/apache/iceberg-go/issues/1080#issuecomment-4453374782

     Yes, you can partition on a substring, the truncate[N] partition transform 
does exactly this for strings. It takes the first N characters as the partition 
value, so all rows sharing a prefix land in the same data files.                
                                                                              
                                                                                
                                                                                
                                                                                
                                                                               
     For your pattern (foo_123, foo_baz_123, bar_123, etc.), something like 
truncate[4] would group all foo_ rows into one partition and bar_ into another. 
When scanning with a predicate on project_identifier, the engine prunes 
partitions whose truncated prefix can't match - eliminating files without 
reading them.    
                                                                                
                                                                                
                                                                                
                                                                               
     -- In Spark SQL equivalent terms:                                          
                                                                                
                                                                                
                                                                               
     PARTITION BY truncate(4, project_identifier)                               
                                                                                
                                                                                
                                                                               
                                                                                
                                                                                
                                                                                
                                                                               
     In iceberg-go, this is iceberg.TruncateTransform{Width: N} on the 
partition spec. Pick a width that captures the meaningful prefix structure of 
your values.                                                                    
                                                                                
          
                                                                                
                                                                                
                                                                                
                                                                               
     Why this helps row group stats: Since truncate preserves order, all values 
within a partition file share a common prefix, making Parquet min/max 
statistics much tighter. You'll get effective row-group skipping even within 
partition files.                                                                
            
                     
     Complementary strategy - sort order: Setting a sort order on 
project_identifier ensures rows are physically sorted within each file. This 
makes min/max column chunk stats as tight as possible. The combination of 
truncate partitioning + sorted writes is the standard approach for this 
scenario.                     
                     
     Alternative - bucket[N]: If your prefixes aren't uniform length or you 
just want even distribution, bucket[N] hashes values into N buckets. Trade-off: 
only equality/IN predicates benefit from pruning (no range scan benefit since 
bucket doesn't preserve order). With ~1000 distinct values, 16–64 buckets works 
well.
                     
     Regarding #931 (dictionary encoding): Dictionary encoding is orthogonal - 
it operates within a single Parquet file, replacing repeated strings with 
integer codes. It helps most when cardinality within a row group is low, which 
sorting naturally achieves. So these strategies stack: partition → sort → 
dictionary  encode.     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to