[I] Reintroduce read_batch in `GenericColumnReader` [arrow-rs]

via GitHub Thu, 30 Nov 2023 05:41:15 -0800


ogrman opened a new issue, #5150:
URL: https://github.com/apache/arrow-rs/issues/5150


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Prior to version 41 of the parquet crate we had access to the `read_batch` 
function, which was deprecated (and changed) in favor of `read_records`. What 
we are trying to does not seem to be possible with the new API. We have a 
program that reads parquet files and concatenates them vertically, so that a 
number of parquet files with identical schemas become one file.
   
   We did this by, for each input file and column:
   
   ```rust
   loop {
       let (values_read, levels_read) = column_reader.read_batch(
           BATCH_SIZE,
           Some(&mut def_levels[..]),
           Some(&mut rep_levels[..]),
           &mut value_buffer[..],
       )?;
       
       if values_read == 0 && levels_read == 0 {
         break;
       }
       
       let values_written = column_writer.write_batch(
           &value_buffer[0..levels_read],
           Some(&def_levels[0..levels_read]),
           Some(&rep_levels[0..levels_read]),
       )?;
       
       assert_eq!(values_written, values_read);
   }
   ```
   
   This simple loop turned many "small" files into one large file, with the 
same schema. After this change when we replace the call to `read_batch` with a 
call to `read_records` we will no longer get a complete batch which means that 
sometimes we will start writing a new batch while rep_levels is still 1.
   
   **Describe the solution you'd like**
   
   A way to simply read a complete batch without fiddling around with 
realigning our buffers between writes. I am also open to suggestions for why 
what we are doing is better solved in a different way, but the code we have 
works great with previous versions of parquet and we are currently blocked from 
upgrading.
   
   **Describe alternatives you've considered**
   
   Manually finding the last element with rep_levels = 0 and stopping our reads 
there, doing some math, writing a batch excluding the end of the buffers, 
copying the end of the buffers to the start of our buffers, and reading fewer 
records according to how much space is already used in the buffers.
   
   **Additional context**
   
   -
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Reintroduce read_batch in `GenericColumnReader` [arrow-rs]

Reply via email to