My thought is just like Iceberg has to define partitioning and bucketing, it 
has to define a canonical sort order. In particular, we can’t afford to have 
Spark, Presto, and Hive writing files in different orders. I believe the right 
approach is to define a sort order as a series of columns where each column is 
either ascending or descending and defining the natural sort order for each 
type.

The hard bit will be if we need to support non-natural sorts of strings. For 
example, if we need to support case-insensitive sorts or the different 
collations that databases support, I’d hope that we could start with the 
default of utf-8 byte ordering and expand as needed. If you are curious what 
the different collations look like - 
https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database
 
<https://dba.stackexchange.com/questions/94887/what-is-the-impact-of-lc-ctype-on-a-postgresql-database>
 .

.. Owen

> On Jul 1, 2019, at 4:18 AM, Anton Okolnychyi <aokolnyc...@apple.com.INVALID> 
> wrote:
> 
> Hey folks,
> 
> Iceberg users are advised not only to partition their data but also to sort 
> within partitions by columns in predicates in order to get the best 
> performance. Right now, this process is mostly manual and performed by users 
> before writing.
> I am wondering if we should extend Iceberg metadata so that query engines can 
> do this automatically in the future. We already have `sortColumns` in 
> DataFile but they are not used.
> Do we need a notion of sort columns in TableMetadata?
> Spark’s sort spec is tightly coupled with bucketing and cannot be used alone. 
> However, it seems reasonable to have partitioned and sorted tables without 
> bucketing. How do we see this in Iceberg?
> If we decide to have sort spec in the metadata, do we want to make it part of 
> PartitionSpec or have it separately?
> Thanks,
> Anton
> 

Reply via email to