adamreeve opened a new issue, #6952: URL: https://github.com/apache/arrow-rs/issues/6952
**Describe the bug** Writing f32 or f64 data to Parquet is a lot slower when there are many NaN values. This can be worked around by disabling dictionary encoding, but that's not always ideal. **To Reproduce** Example code that just writes a repeated NaN or zero value: <details> ```rust use arrow::array::Float32Array; use arrow::datatypes::{DataType, Field, Schema}; use arrow::record_batch::RecordBatch; use parquet::arrow::ArrowWriter; use parquet::{ basic::{Compression, ZstdLevel}, file::properties::WriterProperties, }; use std::fs::File; use std::sync::Arc; use std::time::Instant; fn make_batch(count: usize, value: f32) -> RecordBatch { let schema = Arc::new(Schema::new(vec![Field::new( "value", DataType::Float32, false, )])); let vals = Float32Array::from(vec![value; count]); RecordBatch::try_new(schema, vec![Arc::new(vals)]).unwrap() } fn timed_write(num_rows: usize, value: f32) { let start = Instant::now(); let writer_properties = WriterProperties::builder() .set_compression(Compression::ZSTD(ZstdLevel::try_new(3).unwrap())) .build(); let batch = make_batch(num_rows, value); let file = File::create("data.parquet").unwrap(); let mut writer = ArrowWriter::try_new(file, batch.schema(), Some(writer_properties)).unwrap(); writer.write(&batch).expect("Writing batch"); writer.close().unwrap(); let us = start.elapsed().as_micros(); let us_per_row = us as f64 / num_rows as f64; println!("value={value}, rows={num_rows}, us={us}, us/row={us_per_row}"); } fn main() { let row_counts = vec![1_000, 2_000, 4_000, 8_000, 16_000, 32_000, 64_000, 128_000]; for value in &vec![f32::NAN, 0f32] { for row_count in &row_counts { timed_write(*row_count, *value); } } } ``` </details> On my machine, this outputs: ``` value=NaN, rows=1000, us=1373, us/row=1.373 value=NaN, rows=2000, us=2483, us/row=1.2415 value=NaN, rows=4000, us=9043, us/row=2.26075 value=NaN, rows=8000, us=37263, us/row=4.657875 value=NaN, rows=16000, us=153460, us/row=9.59125 value=NaN, rows=32000, us=695833, us/row=21.74478125 value=NaN, rows=64000, us=3164916, us/row=49.4518125 value=NaN, rows=128000, us=14855691, us/row=116.0600859375 value=0, rows=1000, us=8250, us/row=8.25 value=0, rows=2000, us=122, us/row=0.061 value=0, rows=4000, us=142, us/row=0.0355 value=0, rows=8000, us=194, us/row=0.02425 value=0, rows=16000, us=311, us/row=0.0194375 value=0, rows=32000, us=512, us/row=0.016 value=0, rows=64000, us=973, us/row=0.015203125 value=0, rows=128000, us=1724, us/row=0.01346875 ``` You can see that as the number of rows increases, the time taken increases exponentially when writing NaNs, but close to linearly when writing 0. **Expected behavior** Writing NaNs should perform similarly to writing non-NaN floating point values. **Additional context** This is due to NaN == NaN being false here: https://github.com/apache/arrow-rs/blob/4f1f6e57c568fae8233ab9da7d7c7acdaea4112a/parquet/src/util/interner.rs#L69 So for each NaN value, a new entry is added to the dictionary key storage and hash table, and there will be many non-equal values in the hash table with the same hash. Arrow C++ doesn't have this problem, as they specialise the comparison of floating point types to treat NaNs as equal to all other NaNs for hashing: https://github.com/apache/arrow/blob/5ad0b3e36f9302c4cf8dd5ab997f30bfab95e2d4/cpp/src/arrow/util/hashing.h#L132-L144 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org