drtconway opened a new issue, #15819:
URL: https://github.com/apache/datafusion/issues/15819

   ### Is your feature request related to a problem or challenge?
   
   Tables containing (human) genomic locations often have a column for the 
"chromosome" (usually abbreviated to "chrom"), "start" position, and "end" 
position. (See for example 
[BED](https://en.wikipedia.org/wiki/BED_(file_format)) format.)
   
   The (human) chromosome names are usually written `chr1`, `chr2`, ..., 
`chr9`, `chr10`, ..., `chr22`, `chrX`, `chrY`, `chrM`. These are often followed 
by a couple of thousand alternative chromosomes with labels like `GL000220.1`.
   
   The annoying thing is these labels have an intended ordering as shown (with 
all the alternate chromosomes coming after the primary ones), which is 
different to the lexicographical ordering. (So much grief would have been saved 
if the primary chromosome names had been 0-padded!).
   
   Firstly, the lexicographic ordering puts the alternative chromosomes (which 
have names beginning with capital letters) before the primary chromosomes 
(which all have names beginning with `chr`).
   
   Second, even if no alternative chromosomes are mentioned in the data, the 
lexicographic ordering produces `chr1`, `chr10`, `chr11`, ..., `chr19`, `chr2`, 
`chr20`, and so on.
   
   To make interoperability with other bioinformatics tooling, it would be nice 
to produce output with the intended (non-lexicographical) ordering.
   
   In my schemas, I have been using `DictionaryArray<UInt16Type>` as the array 
type for my `chrom` columns.
   
   My sorting usually looks like:
   
   ```rust
           .sort(vec![
               col("chrom").sort(true, false),
               col("start").sort(true, false),
               col("end").sort(true, false),
           ])?
   ```
   
   ### Describe the solution you'd like
   
   Probably, the most elegant solution for `DictionaryArray` columns would be 
to allow the use of the the `key` values as the sort key, rather than mapping 
to the corresponding `values` for comparison.
   
   This would allow the construction of the column to use 
`StringDictionaryBuilder::<UInt16Type>::new_with_dictionary` with the primary 
chromosomes arranged in the desired order.
   
   It seems undesirable to add a parameter to the `col("chrom").sort()` 
invocation, just to handle this case, but it seems to me that this would be 
reasonable as a property of the `DictionaryArray` itself - whether it should 
sort by the `values` (current behaviour), or the `keys` (new behaviour which 
would address this problem). This property would then act orthogonally to the 
`asc` and `nulls_first` arguments to `sort()`.
   
   A second solution which would perhaps have somewhat more broad applicability 
would be to allow a `sort_with` or `sort_by` method on `Expr` which takes a 
comparator closure/function. I haven't spent enough time working with 
datafusion yet to see exactly how that would look, since the `Expr` level is 
dynamically checked for type consistency with the columns. A simpler 
alternative might be to allow a comparator to be provided on the Array type to 
allow a custom ordering to be defined at the column level.
   
   
   ### Describe alternatives you've considered
   
   The only in-program work-around I have at the moment, is to add a 
`chrom_num` column which explicitly numbers the chromosomes in the desired 
order, and then sort on that, then drop the column. It's not very elegant.
   
   The other alternative is to simply write out the table in the "wrong" order, 
and fix it with other tooling after the fact. This is quite unwieldy in 
practice.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to