dougbrn opened a new issue, #48636:
URL: https://github.com/apache/arrow/issues/48636

   ### Describe the enhancement requested
   
   We recently discovered that nested data structures within parquet files, 
such as struct of lists, do not benefit from multi-threading enabled by default 
in pyarrow's parquet reader. However, if these are instead represented by a 
top-level data structure list a set of list fields, then the multi-threading 
works as expected. It would be nice, if possible, to enable multi-threading 
within nested structures that contain multiple fields. Here's a few code 
snippets/screenshots for context and reproducibility:
   
   File Generation:
   
   ```
   # Code block to generate needed parquet files
   from nested_pandas.datasets import generate_data
   
   # Generate a parquet dataset with struct-list format
   nf = generate_data(100,2000, seed=1)[["nested"]]
   nf.to_parquet("nested_parquet.parquet")
   
   # Generate a parquet dataset with list-array format
   nf["nested"].to_lists().to_parquet("list_parquet.parquet")
   ```
   
   Versioning & Storage Context
   ```
   import pyarrow as pa
   pa.__version__
   > '22.0.0'
   
   # struct of lists storage as read by pyarrow
   pa.parquet.read_table("nested_parquet.parquet").field("nested")
   > pyarrow.Field<nested: struct<t: list<element: double>, flux: list<element: 
double>, band: list<element: string>>>
   
   # list storage as read by pyarrow
   pa.parquet.read_table("list_parquet.parquet").field("t")
   > pyarrow.Field<t: list<element: double>>
   ```
   
   Single-Thread Timings:
   
   <img width="616" height="165" alt="Image" 
src="https://github.com/user-attachments/assets/42cf6292-26d5-4c9e-8bb2-255af3f98b08";
 />
   
   Multi-Thread Timings:
   
   <img width="618" height="166" alt="Image" 
src="https://github.com/user-attachments/assets/4c3229df-2870-4550-bfcb-f661682bc50c";
 />
   
   We see that multi-threading improves the read speed for list-arrays, but not 
for struct-list formatted data.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to