lazear opened a new issue, #4649:
URL: https://github.com/apache/arrow-rs/issues/4649
**Describe the bug**
I am trying to write a parquet file using the SerializedColumnWriter API
(I'm not using RecordBatches), and having issues when trying to write smaller
batches to a column.
Schema:
```
....
required float spectrum_q;
required float peptide_q;
required float protein_q;
optional group reporter_ions (LIST) {
repeated group list {
optional float element;
}
}
```
Each row of my data may or may not have a vector of floats associated with
it (reporter_ion.peaks). Ideally, I could iterate throughthe rows, and write to
the column in small batches (~16 values at a time, or 1 null value for an empty
row).
```rs
let mut scan_map = HashMap::new();
for r in reporter_ions {
scan_map.entry((r.file_id, &r.spec_id)).or_insert(r);
}
// Caller guarantees `reporter_ions` is not empty
let channels = reporter_ions[0].peaks.len();
// https://docs.rs/parquet/44.0.0/parquet/column/index.html
let def_levels = vec![316; channels];
let mut rep_levels = vec![1i16; channels];
rep_levels[0] = 0;
let col = column.typed::<FloatType>();
for feature in features {
if let Some(rs) = scan_map.get(&(feature.file_id, &feature.spec_id))
{
col.write_batch(&rs.peaks, Some(&def_levels),
Some(&rep_levels))?;
} else {
col.write_batch(&[0.0], Some(&[1]), Some(&[0]))?;
}
}
column.close()?;
```
But, this causes a panic
```
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `79`,
right: `0`',
/home/michael/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-44.0.0/src/util/bit_util.rs:272:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted
```
However, if I concatenate all of the data/definition levels/repetition
levels together, and then perform one write, it succeeds:
```rs
let mut data = Vec::new();
let mut def_levels = Vec::new();
let mut rep_levels = Vec::new();
for feature in features {
if let Some(rs) = scan_map.get(&(feature.file_id, &feature.spec_id))
{
data.extend(rs.peaks.iter().copied());
def_levels.extend(std::iter::repeat(3).take(channels));
rep_levels.extend(
std::iter::once(0)
.chain(std::iter::repeat(1))
.take(channels),
);
} else {
data.push(0.0);
def_levels.push(1);
rep_levels.push(0);
}
}
col.write_batch(&data, Some(&def_levels), Some(&rep_levels))?;
column.close()?;
```
I'm not sure if this is a bug, or expected behavior (or a mistake on my
end). It looks like there's an internal `write_mini_batch` function, is it
possible to expose something similar if this is expected behavior? Having to
concatenate all of the row-level vectors into one large vector could be pretty
expensive for my use case (> 20M rows in memory, 16 floats each).
**To Reproduce**
I will see if I can break this into minimal program that reproduces.
**Expected behavior**
I would expect that writing smaller batches would work
**Additional context**
https://github.com/lazear/sage/blob/parquet/crates/sage-cloudpath/src/parquet.rs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]