kosiew commented on issue #15819: URL: https://github.com/apache/datafusion/issues/15819#issuecomment-2826690238
Thanks for raising this! This is a very relevant use case—particularly in genomics, where the desired sort order of chromosome labels (e.g., `chr1`, `chr2`, `chrX`, `chrY`, `chrM`) doesn’t follow lexicographic rules. ### ✅ Proposed Solution: Use a Sort Key via `MemTable` > add a chrom_num column which explicitly numbers the chromosomes in the desired order, and then sort on that, then drop the column. It's not very elegant. While DataFusion doesn’t yet support custom comparators or sorting dictionary arrays by keys, a clean and efficient workaround is to introduce a numeric **sort key** column using a `MemTable` (or a CSV/Parquet file) as a lookup. #### 🧬 Example Create a `chrom_order` mapping table: ```rust let chrom_map = vec![ ("chr1", 1), ("chr2", 2), ("chrX", 23), ("chrY", 24), ("chrM", 25), ("GL000220.1", 1001), ]; let schema = Arc::new(Schema::new(vec![ Field::new("chrom", DataType::Utf8, false), Field::new("sort_key", DataType::Int32, false), ])); let chrom_array = StringArray::from_iter_values(chrom_map.iter().map(|(c, _)| *c)); let key_array = Int32Array::from_iter_values(chrom_map.iter().map(|(_, k)| *k)); let batch = RecordBatch::try_new( schema.clone(), vec![Arc::new(chrom_array), Arc::new(key_array)], )?; let mem_table = MemTable::try_new(schema, vec![vec![batch]])?; ctx.register_table("chrom_order", Arc::new(mem_table))?; ``` ```sql SELECT v.* FROM variants v JOIN chrom_order c ON v.chrom = c.chrom ORDER BY c.sort_key, v.pos; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org