jorisvandenbossche commented on issue #14946: URL: https://github.com/apache/arrow/issues/14946#issuecomment-1353170947
I actually only looked for `FieldPath::Get`, but also taking a look at `FieldRef::GetOne` now, and that's something we actually _do_ use in the compute module. We use that to get the array to sort by in the kernel to sort a RecordBatch: https://github.com/apache/arrow/blob/5ce8d79d7ae4b3864226cc3c5480fa8eba2e571d/cpp/src/arrow/compute/kernels/vector_sort.cc#L1236-L1246 And so because of `GetOne` not having the "flatten" semantics, this actually causes a bug here. Tweaking the cython SortKey bindings a little bit to allow constructing it with a FieldRef (currently in pyarrow we only allow specifying the column to sort by using a string, from C++ you can of course already do that), we can see this bug in action by sorting a RecordBatch that has a struct column that has a top-level null: ``` import pyarrow as pa import pyarrow.compute as pc arr = pa.StructArray.from_arrays( [pa.array([5, 3, 4, 2, 1]), pa.array([1, 2, 3, 4, 5])], names=['a', 'b'], mask=pa.array([False, True, False, False, False]) ) batch = pa.table({"col": arr}).to_batches()[0] In [4]: batch.sort_by([(pc.field("col", "a"), "ascending")]).to_pandas() Out[4]: col 0 {'a': 1, 'b': 5} 1 {'a': 2, 'b': 4} 2 None 3 {'a': 4, 'b': 3} 4 {'a': 5, 'b': 1} ``` while if we compare that to sorting the struct array directly (which was just merged in https://github.com/apache/arrow/pull/14781 and does correctly "flatten" the field when you specify to sort by a field): ```In [5]: pa.table({'col': arr.sort(by="a")}).to_pandas() Out[5]: col 0 {'a': 1, 'b': 5} 1 {'a': 2, 'b': 4} 2 {'a': 4, 'b': 3} 3 {'a': 5, 'b': 1} 4 None ``` In this case the null value is correctly sorted at the end. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
