lazear opened a new issue, #4649:
URL: https://github.com/apache/arrow-rs/issues/4649

   **Describe the bug**
   I am trying to write a parquet file using the SerializedColumnWriter API 
(I'm not using RecordBatches), and having issues when trying to write smaller 
batches to a column.
   
   Schema: 
   ```
               ....
               required float spectrum_q;
               required float peptide_q;
               required float protein_q;
               optional group reporter_ions (LIST) {
                   repeated group list {
                       optional float element;
                   }
               }
   ```
   
   Each row of my data may or may not have a vector of floats associated with 
it (reporter_ion.peaks). Ideally, I could iterate throughthe rows, and write to 
the column in small batches (~16 values at a time, or 1 null value for an empty 
row).
   ```rs
       let mut scan_map = HashMap::new();
   
       for r in reporter_ions {
           scan_map.entry((r.file_id, &r.spec_id)).or_insert(r);
       }
   
       // Caller guarantees `reporter_ions` is not empty
       let channels = reporter_ions[0].peaks.len();
   
       // https://docs.rs/parquet/44.0.0/parquet/column/index.html
       let def_levels = vec![316; channels];
       let mut rep_levels = vec![1i16; channels];
       rep_levels[0] = 0;
   
       let col = column.typed::<FloatType>();
       for feature in features {
           if let Some(rs) = scan_map.get(&(feature.file_id, &feature.spec_id)) 
{
               col.write_batch(&rs.peaks, Some(&def_levels), 
Some(&rep_levels))?;
           } else {
               col.write_batch(&[0.0], Some(&[1]), Some(&[0]))?;
           }
       }
       column.close()?;
   ```
   But, this causes a panic
   ```
   thread 'main' panicked at 'assertion failed: `(left == right)`
     left: `79`,
    right: `0`', 
/home/michael/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-44.0.0/src/util/bit_util.rs:272:9
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   Aborted
   ```
   
   However, if I concatenate all of the data/definition levels/repetition 
levels together, and then perform one write, it succeeds:
   ```rs
       let mut data = Vec::new();
       let mut def_levels = Vec::new();
       let mut rep_levels = Vec::new();
   
       for feature in features {
           if let Some(rs) = scan_map.get(&(feature.file_id, &feature.spec_id)) 
{
               data.extend(rs.peaks.iter().copied());
               def_levels.extend(std::iter::repeat(3).take(channels));
               rep_levels.extend(
                   std::iter::once(0)
                       .chain(std::iter::repeat(1))
                       .take(channels),
               );
           } else {
               data.push(0.0);
               def_levels.push(1);
               rep_levels.push(0);
           }
       }
       col.write_batch(&data, Some(&def_levels), Some(&rep_levels))?;
   
       column.close()?;
   ```
   
   I'm not sure if this is a bug, or expected behavior (or a mistake on my 
end). It looks like there's an internal `write_mini_batch` function, is it 
possible to expose something similar if this is expected behavior? Having to 
concatenate all of the row-level vectors into one large vector could be pretty 
expensive for my use case (> 20M rows in memory, 16 floats each).
   
   
   
   **To Reproduce**
   I will see if I can break this into minimal program that reproduces.
   
   **Expected behavior**
   I would expect that writing smaller batches would work
   
   **Additional context**
   
https://github.com/lazear/sage/blob/parquet/crates/sage-cloudpath/src/parquet.rs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to