jp0317 opened a new pull request, #39818:
URL: https://github.com/apache/arrow/pull/39818

   ### Rationale for this change
   
   Currently each invocation of SkipRecords() for non-repeated fields will 
[create a new 
buffer](https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1482)
 object[1]. I think it probably worth keep the buffer object alive and just 
resize it for each skip, as the buffer is just a bitmap (i.e., should remain 
quite small even we don't free its memory after skip). Performance results are 
as follows:
   
   Keep buffer object alive:
   
   
--------------------------------------------------------------------------------------------------------------
   
   Benchmark                                                    Time            
 CPU   Iterations UserCounters...
   
   
--------------------------------------------------------------------------------------------------------------
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:1      29958201 ns     
29943377 ns           23 bytes_per_second=163.068M/s
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:10      3190298 ns      
3190524 ns          227 bytes_per_second=1.49454G/s
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:100      479056 ns       
480437 ns         1500 bytes_per_second=9.92507G/s
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:1000     256497 ns       
257725 ns         2763 bytes_per_second=18.5018G/s
   
   RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2910364 ns      
2910479 ns          239 bytes_per_second=893.189M/s
   
   RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20539472 ns     
20535632 ns           34 bytes_per_second=135.007M/s
   
   
   
   
   Recreate upon each skip (current behavior):
   
   
--------------------------------------------------------------------------------------------------------------
   
   Benchmark                                                    Time            
 CPU   Iterations UserCounters...
   
   
--------------------------------------------------------------------------------------------------------------
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:1      33261760 ns     
33199124 ns           21 bytes_per_second=147.077M/s
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:10      3256993 ns      
3254609 ns          216 bytes_per_second=1.46511G/s
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:100      492856 ns       
493377 ns         1447 bytes_per_second=9.66477G/s
   
   RecordReaderSkipRecords/Repetition:0/BatchSize:1000     262449 ns       
263227 ns         2694 bytes_per_second=18.1151G/s
   
   RecordReaderSkipRecords/Repetition:1/BatchSize:1000    2996951 ns      
2997148 ns          235 bytes_per_second=867.36M/s
   
   RecordReaderSkipRecords/Repetition:2/BatchSize:1000   20864734 ns     
20850593 ns           34 bytes_per_second=132.968M/s
   
   ### What changes are included in this PR?
   
   change a stack buffer object to a private member in recordreader. 
   
   ### Are these changes tested?
   
   microbenchmarks
   
   ### Are there any user-facing changes?
   
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to