adamreeve opened a new issue, #6952:
URL: https://github.com/apache/arrow-rs/issues/6952
**Describe the bug**
Writing f32 or f64 data to Parquet is a lot slower when there are many NaN
values. This can be worked around by disabling dictionary encoding, but that's
not always ideal.
**To Reproduce**
Example code that just writes a repeated NaN or zero value:
<details>
```rust
use arrow::array::Float32Array;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use parquet::arrow::ArrowWriter;
use parquet::{
basic::{Compression, ZstdLevel},
file::properties::WriterProperties,
};
use std::fs::File;
use std::sync::Arc;
use std::time::Instant;
fn make_batch(count: usize, value: f32) -> RecordBatch {
let schema = Arc::new(Schema::new(vec![Field::new(
"value",
DataType::Float32,
false,
)]));
let vals = Float32Array::from(vec![value; count]);
RecordBatch::try_new(schema, vec![Arc::new(vals)]).unwrap()
}
fn timed_write(num_rows: usize, value: f32) {
let start = Instant::now();
let writer_properties = WriterProperties::builder()
.set_compression(Compression::ZSTD(ZstdLevel::try_new(3).unwrap()))
.build();
let batch = make_batch(num_rows, value);
let file = File::create("data.parquet").unwrap();
let mut writer = ArrowWriter::try_new(file, batch.schema(),
Some(writer_properties)).unwrap();
writer.write(&batch).expect("Writing batch");
writer.close().unwrap();
let us = start.elapsed().as_micros();
let us_per_row = us as f64 / num_rows as f64;
println!("value={value}, rows={num_rows}, us={us}, us/row={us_per_row}");
}
fn main() {
let row_counts = vec![1_000, 2_000, 4_000, 8_000, 16_000, 32_000,
64_000, 128_000];
for value in &vec![f32::NAN, 0f32] {
for row_count in &row_counts {
timed_write(*row_count, *value);
}
}
}
```
</details>
On my machine, this outputs:
```
value=NaN, rows=1000, us=1373, us/row=1.373
value=NaN, rows=2000, us=2483, us/row=1.2415
value=NaN, rows=4000, us=9043, us/row=2.26075
value=NaN, rows=8000, us=37263, us/row=4.657875
value=NaN, rows=16000, us=153460, us/row=9.59125
value=NaN, rows=32000, us=695833, us/row=21.74478125
value=NaN, rows=64000, us=3164916, us/row=49.4518125
value=NaN, rows=128000, us=14855691, us/row=116.0600859375
value=0, rows=1000, us=8250, us/row=8.25
value=0, rows=2000, us=122, us/row=0.061
value=0, rows=4000, us=142, us/row=0.0355
value=0, rows=8000, us=194, us/row=0.02425
value=0, rows=16000, us=311, us/row=0.0194375
value=0, rows=32000, us=512, us/row=0.016
value=0, rows=64000, us=973, us/row=0.015203125
value=0, rows=128000, us=1724, us/row=0.01346875
```
You can see that as the number of rows increases, the time taken increases
exponentially when writing NaNs, but close to linearly when writing 0.
**Expected behavior**
Writing NaNs should perform similarly to writing non-NaN floating point
values.
**Additional context**
This is due to NaN == NaN being false here:
https://github.com/apache/arrow-rs/blob/4f1f6e57c568fae8233ab9da7d7c7acdaea4112a/parquet/src/util/interner.rs#L69
So for each NaN value, a new entry is added to the dictionary key storage
and hash table, and there will be many non-equal values in the hash table with
the same hash.
Arrow C++ doesn't have this problem, as they specialise the comparison of
floating point types to treat NaNs as equal to all other NaNs for hashing:
https://github.com/apache/arrow/blob/5ad0b3e36f9302c4cf8dd5ab997f30bfab95e2d4/cpp/src/arrow/util/hashing.h#L132-L144
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]