[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

via GitHub Tue, 09 May 2023 23:56:26 -0700


rtpsw commented on issue #35498:
URL: https://github.com/apache/arrow/issues/35498#issuecomment-1541449693


   Looking at [the 
code](https://github.com/apache/arrow/issues/35498#issue-1701030317), I suspect 
the reason for degraded performance is because the source table has misaligned 
numpy arrays and each batch of each of these arrays get realigned by 
`EnsureAlignment`, since the aligned default batch size leads the batch-slicing 
to preserve misalignment. This can explain why the performance degradation gets 
worse with larger arrays that get sliced to more batches. One way to verify 
this theory is to increase the batch size in line with the array sizes - the 
performance degradation is expected to be reduced.
   
   As for a cause of the problem, it looks like 
`pa.Table.from_arrays([pa.array(np.arange(length))], names=['A'])` results in 
per-Arrow misaligned arrays, due to zero-copy-wrapping of [misaligned numpy 
arrays](https://numpy.org/devdocs/dev/alignment.html#), which the Arrow spec 
forbids. However, since this code is natural and has likely been accepted since 
the beginning, the realignment should probably be done within Arrow (maybe with 
a warning), or be possible via Arrow configuration. The full arrays have a 
realignment performance cost, of course, but it should be much lower than many 
batches of each array have. Looking out further, I'd suggest considering adding 
facilities for getting per-Arrow aligned numpy arrays and documenting 
accordingly. If possible, better yet is to get numpy to support 
memory-alignment configuration, so that Arrow-user-code would not need to 
change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rtpsw commented on issue #35498: [C++][Parquet] Parquet write_to_dataset performance regression

Reply via email to