adamreeve opened a new issue, #6952:
URL: https://github.com/apache/arrow-rs/issues/6952

   **Describe the bug**
   
   Writing f32 or f64 data to Parquet is a lot slower when there are many NaN 
values. This can be worked around by disabling dictionary encoding, but that's 
not always ideal.
   
   **To Reproduce**
   
   Example code that just writes a repeated NaN or zero value:
   
   <details>
   
   ```rust
   use arrow::array::Float32Array;
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow::record_batch::RecordBatch;
   use parquet::arrow::ArrowWriter;
   use parquet::{
       basic::{Compression, ZstdLevel},
       file::properties::WriterProperties,
   };
   
   use std::fs::File;
   use std::sync::Arc;
   use std::time::Instant;
   
   fn make_batch(count: usize, value: f32) -> RecordBatch {
       let schema = Arc::new(Schema::new(vec![Field::new(
           "value",
           DataType::Float32,
           false,
       )]));
       let vals = Float32Array::from(vec![value; count]);
       RecordBatch::try_new(schema, vec![Arc::new(vals)]).unwrap()
   }
   
   fn timed_write(num_rows: usize, value: f32) {
       let start = Instant::now();
   
       let writer_properties = WriterProperties::builder()
           .set_compression(Compression::ZSTD(ZstdLevel::try_new(3).unwrap()))
           .build();
   
       let batch = make_batch(num_rows, value);
   
       let file = File::create("data.parquet").unwrap();
       let mut writer = ArrowWriter::try_new(file, batch.schema(), 
Some(writer_properties)).unwrap();
   
       writer.write(&batch).expect("Writing batch");
   
       writer.close().unwrap();
   
       let us = start.elapsed().as_micros();
       let us_per_row = us as f64 / num_rows as f64;
   
       println!("value={value}, rows={num_rows}, us={us}, us/row={us_per_row}");
   }
   
   fn main() {
       let row_counts = vec![1_000, 2_000, 4_000, 8_000, 16_000, 32_000, 
64_000, 128_000];
   
       for value in &vec![f32::NAN, 0f32] {
           for row_count in &row_counts {
               timed_write(*row_count, *value);
           }
       }
   }
   ```
   
   </details>
   
   On my machine, this outputs:
   ```
   value=NaN, rows=1000, us=1373, us/row=1.373
   value=NaN, rows=2000, us=2483, us/row=1.2415                      
   value=NaN, rows=4000, us=9043, us/row=2.26075                                
  
   value=NaN, rows=8000, us=37263, us/row=4.657875                              
                                                                                
  
   value=NaN, rows=16000, us=153460, us/row=9.59125             
   value=NaN, rows=32000, us=695833, us/row=21.74478125
   value=NaN, rows=64000, us=3164916, us/row=49.4518125
   value=NaN, rows=128000, us=14855691, us/row=116.0600859375
   value=0, rows=1000, us=8250, us/row=8.25  
   value=0, rows=2000, us=122, us/row=0.061   
   value=0, rows=4000, us=142, us/row=0.0355    
   value=0, rows=8000, us=194, us/row=0.02425   
   value=0, rows=16000, us=311, us/row=0.0194375    
   value=0, rows=32000, us=512, us/row=0.016        
   value=0, rows=64000, us=973, us/row=0.015203125
   value=0, rows=128000, us=1724, us/row=0.01346875
   ```
   
   You can see that as the number of rows increases, the time taken increases 
exponentially when writing NaNs, but close to linearly when writing 0.
   
   **Expected behavior**
   
   Writing NaNs should perform similarly to writing non-NaN floating point 
values.
   
   **Additional context**
   
   This is due to NaN == NaN being false here: 
https://github.com/apache/arrow-rs/blob/4f1f6e57c568fae8233ab9da7d7c7acdaea4112a/parquet/src/util/interner.rs#L69
   
   So for each NaN value, a new entry is added to the dictionary key storage 
and hash table, and there will be many non-equal values in the hash table with 
the same hash.
   
   Arrow C++ doesn't have this problem, as they specialise the comparison of 
floating point types to treat NaNs as equal to all other NaNs for hashing: 
https://github.com/apache/arrow/blob/5ad0b3e36f9302c4cf8dd5ab997f30bfab95e2d4/cpp/src/arrow/util/hashing.h#L132-L144


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to