jorisvandenbossche commented on issue #14946:
URL: https://github.com/apache/arrow/issues/14946#issuecomment-1353170947

   I actually only looked for `FieldPath::Get`, but also taking a look at 
`FieldRef::GetOne` now, and that's something we actually _do_ use in the 
compute module. We use that to get the array to sort by in the kernel to sort a 
RecordBatch:
   
   
https://github.com/apache/arrow/blob/5ce8d79d7ae4b3864226cc3c5480fa8eba2e571d/cpp/src/arrow/compute/kernels/vector_sort.cc#L1236-L1246
   
   And so because of `GetOne` not having the "flatten" semantics, this actually 
causes a bug here. 
   
   Tweaking the cython SortKey bindings a little bit to allow constructing it 
with a FieldRef (currently in pyarrow we only allow specifying the column to 
sort by using a string, from C++ you can of course already do that), we can see 
this bug in action by sorting a RecordBatch that has a struct column that has a 
top-level null:
   
   ```
   import pyarrow as pa
   import pyarrow.compute as pc
   arr = pa.StructArray.from_arrays(
       [pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'],
       mask=pa.array([False, True, False, False, False])
   )
   batch = pa.table({"col": arr}).to_batches()[0]
   
   In [4]: batch.sort_by([(pc.field("col", "a"), "ascending")]).to_pandas()
   Out[4]: 
                   col
   0  {'a': 1, 'b': 5}
   1  {'a': 2, 'b': 4}
   2              None
   3  {'a': 4, 'b': 3}
   4  {'a': 5, 'b': 1}
   ```
   
   while if we compare that to sorting the struct array directly (which was 
just merged in https://github.com/apache/arrow/pull/14781 and does correctly 
"flatten" the field when you specify to sort by a field):
   
   ```In [5]: pa.table({'col': arr.sort(by="a")}).to_pandas()
   Out[5]: 
                   col
   0  {'a': 1, 'b': 5}
   1  {'a': 2, 'b': 4}
   2  {'a': 4, 'b': 3}
   3  {'a': 5, 'b': 1}
   4              None
   ```
   
   In this case the null value is correctly sorted at the end.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to