My thought is just like Iceberg has to define partitioning and bucketing, it has to define a canonical sort order. In particular, we can’t afford to have Spark, Presto, and Hive writing files in different orders. I believe the right approach is to define a sort order as a series of columns where each column is either ascending or descending and defining the natural sort order for each type.
The hard bit will be if we need to support non-natural sorts of strings. For example, if we need to support case-insensitive sorts or the different collations that databases support, I’d hope that we could start with the default of utf-8 byte ordering and expand as needed. If you are curious what the different collations look like - https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database <https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database> . .. Owen > On Jul 1, 2019, at 4:18 AM, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> > wrote: > > Hey folks, > > Iceberg users are advised not only to partition their data but also to sort > within partitions by columns in predicates in order to get the best > performance. Right now, this process is mostly manual and performed by users > before writing. > I am wondering if we should extend Iceberg metadata so that query engines can > do this automatically in the future. We already have `sortColumns` in > DataFile but they are not used. > Do we need a notion of sort columns in TableMetadata? > Spark’s sort spec is tightly coupled with bucketing and cannot be used alone. > However, it seems reasonable to have partitioned and sorted tables without > bucketing. How do we see this in Iceberg? > If we decide to have sort spec in the metadata, do we want to make it part of > PartitionSpec or have it separately? > Thanks, > Anton >