Hattonuri opened a new issue, #39056: URL: https://github.com/apache/arrow/issues/39056
### Describe the usage question you have. Please include as many useful details as possible. I am trying to reduce memory usage of reading parquet files locally. Class has acero::Declaration consisting of ScanNode(it uses use_threads=false but as i saw in gdb backtrace it doesn't change anything)+SinkNode Then I found out that this code https://github.com/apache/arrow/blob/47dadb02c3426c5bdd0df903dbc0f6ec17c5c784/cpp/src/parquet/arrow/reader.cc#L1151-L1164 Is the only thing which does not reuse its large allocations and the biggest memory consumer that i saw and I don't have a single idea how to get rid of it. So the idea was to reduce row group size(for example by 2 times) of writers and arrow_batch_size of reader to gain memory performance. The question is: **Will it somehow affect reading speed except for the reason that files will be slightly bigger?** And another question is: **Will other systems like spark suffer from this?** I think that no because they will have default arrow batch size I wrote that gdb script for finding repeatable allocations if you need it ``` del set pagination off # ReallocateAligned if (( $rdi > 1000000 )) b AllocateAligned if (($rsi > 10000000)) # b PoolBuffer::Reserve if (( $rdx > 1000000 )) commands silent # printf "realloc from: %d to %d\n", $rdx, $r14 printf "alloc for %d bytes\n", $rsi # bt 30 # printf "----------------------------------------------------------\n" continue end # skip allocations in the beginning c 10 r ``` Probably you'll also have any ideas how that code can start caching allocations, because on flamegraph(mentioned here https://github.com/apache/arrow/issues/38881) that make the program use about 30% more cpu ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
