viirya opened a new pull request, #50327:
URL: https://github.com/apache/arrow/pull/50327

   ### Rationale for this change
   
   `Array.to_pylist()` on list-typed arrays is 2.5–10x slower than converting 
the same array via `to_pandas()` and rebuilding Python lists from the resulting 
numpy arrays, even though `to_pylist` does strictly less work. The cause is the 
per-element conversion loop (`[x.as_py() for x in self]`): every row allocates 
a C++ Scalar (`Array::GetScalar`), a Python Scalar wrapper and, for list types, 
a Python Array wrapper for the row's values slice plus a fresh generator before 
recursing per element. Besides the allocation cost, these GC-tracked wrappers 
repeatedly trigger CPython collections that traverse the ever-growing result 
list (~20% of runtime in a `sample` profile; details in #50326).
   
   This hit Apache Spark when it enabled Arrow-serialized Python UDFs by 
default (apache/spark#56940, apache/spark#56943); working around it via 
`to_pandas()` was rejected there because the pandas detour coerces 
`list<int32>` with nulls to numpy `float64` (`[1., nan, 3.]` instead of `[1, 
None, 3]`).
   
   Benchmarks (macOS arm64, Python 3.11; 2M rows of 2-element lists / 1M rows 
of nested lists):
   
   | benchmark | before | after | speedup |
   |---|---|---|---|
   | `list<string>` to_pylist | 1.93 s | 0.34 s | 5.7x |
   | `list<list<int32>>` to_pylist | 2.10 s | 0.65 s | 3.2x |
   | flat `string` to_pylist (4M) | 0.83 s | 0.05 s | 16x |
   
   For reference, the pandas detour (`to_pandas()` + per-row `tolist()`) takes 
0.75 s on the `list<string>` case, so `to_pylist` goes from 2.5x slower to 
~2.2x faster.
   
   ### What changes are included in this PR?
   
   Bulk `to_pylist` overrides in `array.pxi`:
   
   - `ListArray` / `LargeListArray` / `FixedSizeListArray`: convert the 
referenced range of child values with a single recursive `to_pylist` call, then 
slice the resulting Python list per row using the raw offsets and the validity 
bitmap. No per-row Scalar, Python Array wrapper or generator. `MapArray` 
explicitly keeps the generic scalar-based path (association-tuple / 
`maps_as_pydicts` duplicate-key semantics), as do the list-view types 
(overlapping views must not share sublist objects).
   - `StringArray` / `LargeStringArray`: decode values directly from the data 
buffer (`GetValue` + `PyUnicode_DecodeUTF8`), which matches 
`StringScalar.as_py` (= `str(buf, 'utf8')`) exactly.
   
   Output is unchanged, including exact element types: `None` stays `None`, 
values inside numeric lists stay Python ints (never floats/NaN), strings/bytes 
are unchanged. `ChunkedArray.to_pylist`, `Table.to_pylist` and 
`ListScalar.as_py` delegate to `Array.to_pylist` and pick up the speedup 
automatically.
   
   Follow-up candidates (not in this PR): leaf fast paths for 
primitive/binary/view types, a bulk path for maps and structs, or a general C++ 
`ToPyList` visitor covering all types.
   
   ### Are these changes tested?
   
   - New `test_to_pylist_bulk_paths` compares the bulk paths against the 
per-scalar conversion (`[x.as_py() for x in arr]`) for 
list/large_list/fixed_size_list/nested/map/string/large_string arrays, 
including sliced, empty and all-null arrays, and asserts exact element types 
for `list<int32>` with nulls.
   - Existing suites pass: `test_array.py`, `test_scalars.py`, 
`test_convert_builtin.py`, `test_table.py` (1209 passed locally).
   - Additionally verified with a randomized differential test (8 leaf types x 
list/large_list/fixed_size_list/map, nested lists, list\<struct\>, list\<map\>, 
slices, both `maps_as_pydicts` modes, multibyte strings) with exact-type 
comparison: no differences.
   
   ### Are there any user-facing changes?
   
   No behavior changes, only performance: `to_pylist()` on list-like and string 
arrays is several times faster.
   
   * GitHub Issue: #50326
   
   This pull request and its description were written by Isaac.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to