drtconway opened a new issue, #15819: URL: https://github.com/apache/datafusion/issues/15819
### Is your feature request related to a problem or challenge? Tables containing (human) genomic locations often have a column for the "chromosome" (usually abbreviated to "chrom"), "start" position, and "end" position. (See for example [BED](https://en.wikipedia.org/wiki/BED_(file_format)) format.) The (human) chromosome names are usually written `chr1`, `chr2`, ..., `chr9`, `chr10`, ..., `chr22`, `chrX`, `chrY`, `chrM`. These are often followed by a couple of thousand alternative chromosomes with labels like `GL000220.1`. The annoying thing is these labels have an intended ordering as shown (with all the alternate chromosomes coming after the primary ones), which is different to the lexicographical ordering. (So much grief would have been saved if the primary chromosome names had been 0-padded!). Firstly, the lexicographic ordering puts the alternative chromosomes (which have names beginning with capital letters) before the primary chromosomes (which all have names beginning with `chr`). Second, even if no alternative chromosomes are mentioned in the data, the lexicographic ordering produces `chr1`, `chr10`, `chr11`, ..., `chr19`, `chr2`, `chr20`, and so on. To make interoperability with other bioinformatics tooling, it would be nice to produce output with the intended (non-lexicographical) ordering. In my schemas, I have been using `DictionaryArray<UInt16Type>` as the array type for my `chrom` columns. My sorting usually looks like: ```rust .sort(vec![ col("chrom").sort(true, false), col("start").sort(true, false), col("end").sort(true, false), ])? ``` ### Describe the solution you'd like Probably, the most elegant solution for `DictionaryArray` columns would be to allow the use of the the `key` values as the sort key, rather than mapping to the corresponding `values` for comparison. This would allow the construction of the column to use `StringDictionaryBuilder::<UInt16Type>::new_with_dictionary` with the primary chromosomes arranged in the desired order. It seems undesirable to add a parameter to the `col("chrom").sort()` invocation, just to handle this case, but it seems to me that this would be reasonable as a property of the `DictionaryArray` itself - whether it should sort by the `values` (current behaviour), or the `keys` (new behaviour which would address this problem). This property would then act orthogonally to the `asc` and `nulls_first` arguments to `sort()`. A second solution which would perhaps have somewhat more broad applicability would be to allow a `sort_with` or `sort_by` method on `Expr` which takes a comparator closure/function. I haven't spent enough time working with datafusion yet to see exactly how that would look, since the `Expr` level is dynamically checked for type consistency with the columns. A simpler alternative might be to allow a comparator to be provided on the Array type to allow a custom ordering to be defined at the column level. ### Describe alternatives you've considered The only in-program work-around I have at the moment, is to add a `chrom_num` column which explicitly numbers the chromosomes in the desired order, and then sort on that, then drop the column. It's not very elegant. The other alternative is to simply write out the table in the "wrong" order, and fix it with other tooling after the fact. This is quite unwieldy in practice. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org