viirya opened a new pull request, #50327: URL: https://github.com/apache/arrow/pull/50327
### Rationale for this change `Array.to_pylist()` on list-typed arrays is 2.5–10x slower than converting the same array via `to_pandas()` and rebuilding Python lists from the resulting numpy arrays, even though `to_pylist` does strictly less work. The cause is the per-element conversion loop (`[x.as_py() for x in self]`): every row allocates a C++ Scalar (`Array::GetScalar`), a Python Scalar wrapper and, for list types, a Python Array wrapper for the row's values slice plus a fresh generator before recursing per element. Besides the allocation cost, these GC-tracked wrappers repeatedly trigger CPython collections that traverse the ever-growing result list (~20% of runtime in a `sample` profile; details in #50326). This hit Apache Spark when it enabled Arrow-serialized Python UDFs by default (apache/spark#56940, apache/spark#56943); working around it via `to_pandas()` was rejected there because the pandas detour coerces `list<int32>` with nulls to numpy `float64` (`[1., nan, 3.]` instead of `[1, None, 3]`). Benchmarks (macOS arm64, Python 3.11; 2M rows of 2-element lists / 1M rows of nested lists): | benchmark | before | after | speedup | |---|---|---|---| | `list<string>` to_pylist | 1.93 s | 0.34 s | 5.7x | | `list<list<int32>>` to_pylist | 2.10 s | 0.65 s | 3.2x | | flat `string` to_pylist (4M) | 0.83 s | 0.05 s | 16x | For reference, the pandas detour (`to_pandas()` + per-row `tolist()`) takes 0.75 s on the `list<string>` case, so `to_pylist` goes from 2.5x slower to ~2.2x faster. ### What changes are included in this PR? Bulk `to_pylist` overrides in `array.pxi`: - `ListArray` / `LargeListArray` / `FixedSizeListArray`: convert the referenced range of child values with a single recursive `to_pylist` call, then slice the resulting Python list per row using the raw offsets and the validity bitmap. No per-row Scalar, Python Array wrapper or generator. `MapArray` explicitly keeps the generic scalar-based path (association-tuple / `maps_as_pydicts` duplicate-key semantics), as do the list-view types (overlapping views must not share sublist objects). - `StringArray` / `LargeStringArray`: decode values directly from the data buffer (`GetValue` + `PyUnicode_DecodeUTF8`), which matches `StringScalar.as_py` (= `str(buf, 'utf8')`) exactly. Output is unchanged, including exact element types: `None` stays `None`, values inside numeric lists stay Python ints (never floats/NaN), strings/bytes are unchanged. `ChunkedArray.to_pylist`, `Table.to_pylist` and `ListScalar.as_py` delegate to `Array.to_pylist` and pick up the speedup automatically. Follow-up candidates (not in this PR): leaf fast paths for primitive/binary/view types, a bulk path for maps and structs, or a general C++ `ToPyList` visitor covering all types. ### Are these changes tested? - New `test_to_pylist_bulk_paths` compares the bulk paths against the per-scalar conversion (`[x.as_py() for x in arr]`) for list/large_list/fixed_size_list/nested/map/string/large_string arrays, including sliced, empty and all-null arrays, and asserts exact element types for `list<int32>` with nulls. - Existing suites pass: `test_array.py`, `test_scalars.py`, `test_convert_builtin.py`, `test_table.py` (1209 passed locally). - Additionally verified with a randomized differential test (8 leaf types x list/large_list/fixed_size_list/map, nested lists, list\<struct\>, list\<map\>, slices, both `maps_as_pydicts` modes, multibyte strings) with exact-type comparison: no differences. ### Are there any user-facing changes? No behavior changes, only performance: `to_pylist()` on list-like and string arrays is several times faster. * GitHub Issue: #50326 This pull request and its description were written by Isaac. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
