dougbrn opened a new issue, #48636:
URL: https://github.com/apache/arrow/issues/48636
### Describe the enhancement requested
We recently discovered that nested data structures within parquet files,
such as struct of lists, do not benefit from multi-threading enabled by default
in pyarrow's parquet reader. However, if these are instead represented by a
top-level data structure list a set of list fields, then the multi-threading
works as expected. It would be nice, if possible, to enable multi-threading
within nested structures that contain multiple fields. Here's a few code
snippets/screenshots for context and reproducibility:
File Generation:
```
# Code block to generate needed parquet files
from nested_pandas.datasets import generate_data
# Generate a parquet dataset with struct-list format
nf = generate_data(100,2000, seed=1)[["nested"]]
nf.to_parquet("nested_parquet.parquet")
# Generate a parquet dataset with list-array format
nf["nested"].to_lists().to_parquet("list_parquet.parquet")
```
Versioning & Storage Context
```
import pyarrow as pa
pa.__version__
> '22.0.0'
# struct of lists storage as read by pyarrow
pa.parquet.read_table("nested_parquet.parquet").field("nested")
> pyarrow.Field<nested: struct<t: list<element: double>, flux: list<element:
double>, band: list<element: string>>>
# list storage as read by pyarrow
pa.parquet.read_table("list_parquet.parquet").field("t")
> pyarrow.Field<t: list<element: double>>
```
Single-Thread Timings:
<img width="616" height="165" alt="Image"
src="https://github.com/user-attachments/assets/42cf6292-26d5-4c9e-8bb2-255af3f98b08"
/>
Multi-Thread Timings:
<img width="618" height="166" alt="Image"
src="https://github.com/user-attachments/assets/4c3229df-2870-4550-bfcb-f661682bc50c"
/>
We see that multi-threading improves the read speed for list-arrays, but not
for struct-list formatted data.
### Component(s)
Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]