[I] Does row group size affect reading speed? [arrow]

via GitHub Sun, 03 Dec 2023 12:15:14 -0800


Hattonuri opened a new issue, #39056:
URL: https://github.com/apache/arrow/issues/39056


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   I am trying to reduce memory usage of reading parquet files locally.
   Class has acero::Declaration consisting of ScanNode(it uses 
use_threads=false but as i saw in gdb backtrace it doesn't change 
anything)+SinkNode
   
   Then I found out that this code
   
https://github.com/apache/arrow/blob/47dadb02c3426c5bdd0df903dbc0f6ec17c5c784/cpp/src/parquet/arrow/reader.cc#L1151-L1164
   Is the only thing which does not reuse its large allocations and the biggest 
memory consumer that i saw and I don't have a single idea how to get rid of it.
   So the idea was to reduce row group size(for example by 2 times) of writers 
and arrow_batch_size of reader to gain memory performance.
   
   The question is: **Will it somehow affect reading speed except for the 
reason that files will be slightly bigger?**
   And another question is: **Will other systems like spark suffer from this?** 
   I think that no because they will have default arrow batch size
   
   I wrote that gdb script for finding repeatable allocations if you need it 
   ```
   del
   set pagination off
   
   # ReallocateAligned if (( $rdi > 1000000 ))
   b AllocateAligned if (($rsi > 10000000))
   # b PoolBuffer::Reserve if (( $rdx > 1000000 ))
   commands
     silent
     # printf "realloc from: %d to %d\n", $rdx, $r14
     printf "alloc for %d bytes\n", $rsi
     # bt 30
     # printf "----------------------------------------------------------\n"
     continue
   end
   
   # skip allocations in the beginning
   c 10
   r
   ```
   
   
   Probably you'll also have any ideas how that code can start caching 
allocations, because on flamegraph(mentioned here 
https://github.com/apache/arrow/issues/38881) that make the program use about 
30% more cpu
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Does row group size affect reading speed? [arrow]

Reply via email to