kosiew commented on issue #15819:
URL: https://github.com/apache/datafusion/issues/15819#issuecomment-2826690238

   Thanks for raising this! This is a very relevant use case—particularly in 
genomics, where the desired sort order of chromosome labels (e.g., `chr1`, 
`chr2`, `chrX`, `chrY`, `chrM`) doesn’t follow lexicographic rules.
   
   ### ✅ Proposed Solution: Use a Sort Key via `MemTable`
   
   > add a chrom_num column which explicitly numbers the chromosomes in the 
desired order, and then sort on that, then drop the column. It's not very 
elegant.
   
   While DataFusion doesn’t yet support custom comparators or sorting 
dictionary arrays by keys, a clean and efficient workaround is to introduce a 
numeric **sort key** column using a `MemTable` (or a CSV/Parquet file) as a 
lookup.
   
   #### 🧬 Example
   
   Create a `chrom_order` mapping table:
   
   ```rust
   let chrom_map = vec![
       ("chr1", 1),
       ("chr2", 2),
       ("chrX", 23),
       ("chrY", 24),
       ("chrM", 25),
       ("GL000220.1", 1001),
   ];
   
   let schema = Arc::new(Schema::new(vec![
       Field::new("chrom", DataType::Utf8, false),
       Field::new("sort_key", DataType::Int32, false),
   ]));
   
   let chrom_array = StringArray::from_iter_values(chrom_map.iter().map(|(c, 
_)| *c));
   let key_array = Int32Array::from_iter_values(chrom_map.iter().map(|(_, k)| 
*k));
   
   let batch = RecordBatch::try_new(
       schema.clone(),
       vec![Arc::new(chrom_array), Arc::new(key_array)],
   )?;
   
   let mem_table = MemTable::try_new(schema, vec![vec![batch]])?;
   ctx.register_table("chrom_order", Arc::new(mem_table))?;
   ```
   
   ```sql
   SELECT v.*
   FROM variants v
   JOIN chrom_order c ON v.chrom = c.chrom
   ORDER BY c.sort_key, v.pos;
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to