Re: [PR] [Feature][QDP]ParquetReader generic over f32/f64 with Arrow cast [mahout]

via GitHub Tue, 09 Jun 2026 05:38:14 -0700


rich7420 commented on code in PR #1393:
URL: https://github.com/apache/mahout/pull/1393#discussion_r3380622499



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -180,8 +305,11 @@ impl DataReader for ParquetReader {
                             all_data.reserve(current_size * self.total_rows);
                         }
 
-                        handle_float64_nulls(&mut all_data, float_array, 
self.null_handling)?;
-
+                        T::extend_from_arrow_array(

Review Comment:
   This calls `extend_from_arrow_array` once per row, so on the cross-dtype 
path `compute::cast` runs (and allocates) per row. The FixedSizeList arm 
already casts the whole `values()` buffer once — could do the same here: 
validate sizes in the loop, then extend once from `list_array.values()`. Big 
win for wide/large files.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -17,24 +17,180 @@
 //! Parquet format reader implementation.
 
 use std::fs::File;
+use std::marker::PhantomData;
 use std::path::Path;
 
-use arrow::array::{Array, FixedSizeListArray, Float64Array, ListArray};
+use arrow::array::{Array, FixedSizeListArray, Float32Array, Float64Array, 
ListArray};
+use arrow::compute;
 use arrow::datatypes::DataType;
 use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
 
 use crate::error::{MahoutError, Result};
-use crate::reader::{DataReader, NullHandling, StreamingDataReader, 
handle_float64_nulls};
+use crate::reader::{
+    DataReader, FloatElem, NullHandling, StreamingDataReader, 
handle_float32_nulls,
+    handle_float64_nulls,
+};
 
-/// Reader for Parquet files containing List<Float64> or 
FixedSizeList<Float64> columns.
-pub struct ParquetReader {
+// ---------------------------------------------------------------------------
+// Internal sealed helper trait
+// ---------------------------------------------------------------------------
+
+/// Zero-copy or Arrow-cast extraction from an Arrow array into an output 
`Vec<Self>`.
+///
+/// Implemented for `f32` and `f64` only:
+/// - same dtype → `extend_from_slice` directly from the Arrow buffer (zero 
alloc)
+/// - cross dtype → `arrow::compute::cast` first, then `extend_from_slice`
+///   - f32 → f64: exact, NaN preserved
+///   - f64 → f32: values outside f32 range become ±Inf, NaN preserved
+pub(crate) trait ArrowPrimitive: FloatElem {
+    fn extend_from_arrow_array(
+        output: &mut Vec<Self>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()>;
+
+    fn collect_from_arrow_array(
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<Vec<Self>> {
+        let mut out = Vec::new();
+        Self::extend_from_arrow_array(&mut out, array, null_handling)?;
+        Ok(out)
+    }
+}
+
+impl ArrowPrimitive for f64 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f64>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float64 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Float64 downcast 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float32 => {
+                let casted = compute::cast(array, &DataType::Float64)
+                    .map_err(|e| MahoutError::Io(format!("Arrow cast f32→f64: 
{e}")))?;

Review Comment:
   Cast failures map to `MahoutError::Io`, but a dtype conversion isn't really 
I/O. `InvalidInput` fits better and matches the `other =>` arm right below. 
Same for the f64→f32 one on L112.



##########
qdp/qdp-core/src/reader.rs:
##########
@@ -96,6 +96,34 @@ pub fn handle_float64_nulls(
     Ok(())
 }
 
+/// Append values from a `Float32Array` into `output`, applying the given null 
policy.
+///
+/// When there are no nulls the fast path copies the underlying buffer 
directly.
+pub fn handle_float32_nulls(

Review Comment:
   `handle_float32_nulls` is a verbatim copy of `handle_float64_nulls` (only 
the type + the `32`/`64` literal differ). Could be one generic 
`handle_primitive_nulls<T: ArrowPrimitiveType>` with `T::Native: Default` (use 
`unwrap_or_default()`), keeping these two as thin wrappers for the public API. 
Optional.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -17,24 +17,180 @@
 //! Parquet format reader implementation.
 
 use std::fs::File;
+use std::marker::PhantomData;
 use std::path::Path;
 
-use arrow::array::{Array, FixedSizeListArray, Float64Array, ListArray};
+use arrow::array::{Array, FixedSizeListArray, Float32Array, Float64Array, 
ListArray};
+use arrow::compute;
 use arrow::datatypes::DataType;
 use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
 
 use crate::error::{MahoutError, Result};
-use crate::reader::{DataReader, NullHandling, StreamingDataReader, 
handle_float64_nulls};
+use crate::reader::{
+    DataReader, FloatElem, NullHandling, StreamingDataReader, 
handle_float32_nulls,
+    handle_float64_nulls,
+};
 
-/// Reader for Parquet files containing List<Float64> or 
FixedSizeList<Float64> columns.
-pub struct ParquetReader {
+// ---------------------------------------------------------------------------
+// Internal sealed helper trait
+// ---------------------------------------------------------------------------
+
+/// Zero-copy or Arrow-cast extraction from an Arrow array into an output 
`Vec<Self>`.
+///
+/// Implemented for `f32` and `f64` only:
+/// - same dtype → `extend_from_slice` directly from the Arrow buffer (zero 
alloc)
+/// - cross dtype → `arrow::compute::cast` first, then `extend_from_slice`
+///   - f32 → f64: exact, NaN preserved
+///   - f64 → f32: values outside f32 range become ±Inf, NaN preserved
+pub(crate) trait ArrowPrimitive: FloatElem {
+    fn extend_from_arrow_array(
+        output: &mut Vec<Self>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()>;
+
+    fn collect_from_arrow_array(
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<Vec<Self>> {
+        let mut out = Vec::new();
+        Self::extend_from_arrow_array(&mut out, array, null_handling)?;
+        Ok(out)
+    }
+}
+
+impl ArrowPrimitive for f64 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f64>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float64 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Float64 downcast 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float32 => {
+                let casted = compute::cast(array, &DataType::Float64)
+                    .map_err(|e| MahoutError::Io(format!("Arrow cast f32→f64: 
{e}")))?;
+                let arr = casted
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Cast to Float64 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            other => {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected Float32 or Float64 values, got {other:?}"
+                )));
+            }
+        }
+        Ok(())
+    }
+}
+
+impl ArrowPrimitive for f32 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f32>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float32 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float32Array>()
+                    .ok_or_else(|| MahoutError::Io("Float32 downcast 
failed".to_string()))?;
+                handle_float32_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float64 => {
+                // f64 → f32: values outside f32 range become ±Inf; NaN is 
preserved
+                let casted = compute::cast(array, &DataType::Float32)
+                    .map_err(|e| MahoutError::Io(format!("Arrow cast f64→f32: 
{e}")))?;
+                let arr = casted
+                    .as_any()
+                    .downcast_ref::<Float32Array>()
+                    .ok_or_else(|| MahoutError::Io("Cast to Float32 
failed".to_string()))?;
+                handle_float32_nulls(output, arr, null_handling)?;
+            }
+            other => {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected Float32 or Float64 values, got {other:?}"
+                )));
+            }
+        }
+        Ok(())
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Shared schema validator
+// ---------------------------------------------------------------------------
+
+fn validate_float_list_schema(field: &arrow::datatypes::Field) -> Result<()> {
+    match field.data_type() {
+        DataType::List(child_field) => {
+            if !matches!(
+                child_field.data_type(),
+                DataType::Float32 | DataType::Float64
+            ) {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected List<Float32> or List<Float64> column, got 
List<{:?}>",
+                    child_field.data_type()
+                )));
+            }
+        }
+        DataType::FixedSizeList(child_field, _) => {
+            if !matches!(
+                child_field.data_type(),
+                DataType::Float32 | DataType::Float64
+            ) {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected FixedSizeList<Float32> or FixedSizeList<Float64> 
column, got FixedSizeList<{:?}>",
+                    child_field.data_type()
+                )));
+            }
+        }
+        _ => {
+            return Err(MahoutError::InvalidInput(format!(
+                "Expected List<Float32/Float64> or 
FixedSizeList<Float32/Float64> column, got {:?}",
+                field.data_type()
+            )));
+        }
+    }
+    Ok(())
+}
+
+fn validate_float_list_or_scalar_schema(field: &arrow::datatypes::Field) -> 
Result<()> {
+    match field.data_type() {
+        DataType::Float32 | DataType::Float64 => Ok(()),
+        _ => validate_float_list_schema(field),
+    }
+}
+
+// ---------------------------------------------------------------------------
+// ParquetReader<T>
+// ---------------------------------------------------------------------------
+
+/// Reader for Parquet files containing `List<Float32/Float64>` or
+/// `FixedSizeList<Float32/Float64>` columns.
+///
+/// Generic over `T` (`f32` or `f64`):
+/// - same dtype as the file → zero-copy path via `extend_from_slice`
+/// - different dtype → `arrow::compute::cast` (f64→f32: overflow → ±Inf; NaN 
preserved)
+pub struct ParquetReader<T: FloatElem = f64> {
     reader: Option<parquet::arrow::arrow_reader::ParquetRecordBatchReader>,
     sample_size: Option<usize>,
     total_rows: usize,
     null_handling: NullHandling,
+    _phantom: PhantomData<T>,
 }
 
-impl ParquetReader {
+#[allow(private_bounds)]

Review Comment:
   `#[allow(private_bounds)]` on a public `new` because `ArrowPrimitive` is 
`pub(crate)` — downstream users can't name the bound and only see it via a 
compile error. Since it's effectively a sealed f32/f64 trait, putting the bound 
on the public `FloatElem` (or a sealed public trait) would drop the `allow`. 
Minor design nit.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -439,24 +541,16 @@ impl StreamingDataReader for ParquetStreamingReader {
                                 continue;
                             }
 
-                            let mut batch_values = Vec::new();
+                            let mut batch_values: Vec<T> = Vec::new();
                             let mut current_sample_size = None;
                             for i in 0..list_array.len() {

Review Comment:
   Same per-row cast pattern as the non-streaming path (L292) — one 
`compute::cast` alloc per row on cross-dtype reads. Cast the batch `values()` 
once instead.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -17,24 +17,180 @@
 //! Parquet format reader implementation.
 
 use std::fs::File;
+use std::marker::PhantomData;
 use std::path::Path;
 
-use arrow::array::{Array, FixedSizeListArray, Float64Array, ListArray};
+use arrow::array::{Array, FixedSizeListArray, Float32Array, Float64Array, 
ListArray};
+use arrow::compute;
 use arrow::datatypes::DataType;
 use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
 
 use crate::error::{MahoutError, Result};
-use crate::reader::{DataReader, NullHandling, StreamingDataReader, 
handle_float64_nulls};
+use crate::reader::{
+    DataReader, FloatElem, NullHandling, StreamingDataReader, 
handle_float32_nulls,
+    handle_float64_nulls,
+};
 
-/// Reader for Parquet files containing List<Float64> or 
FixedSizeList<Float64> columns.
-pub struct ParquetReader {
+// ---------------------------------------------------------------------------
+// Internal sealed helper trait
+// ---------------------------------------------------------------------------
+
+/// Zero-copy or Arrow-cast extraction from an Arrow array into an output 
`Vec<Self>`.
+///
+/// Implemented for `f32` and `f64` only:
+/// - same dtype → `extend_from_slice` directly from the Arrow buffer (zero 
alloc)
+/// - cross dtype → `arrow::compute::cast` first, then `extend_from_slice`
+///   - f32 → f64: exact, NaN preserved
+///   - f64 → f32: values outside f32 range become ±Inf, NaN preserved
+pub(crate) trait ArrowPrimitive: FloatElem {
+    fn extend_from_arrow_array(
+        output: &mut Vec<Self>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()>;
+
+    fn collect_from_arrow_array(
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<Vec<Self>> {
+        let mut out = Vec::new();
+        Self::extend_from_arrow_array(&mut out, array, null_handling)?;
+        Ok(out)
+    }
+}
+
+impl ArrowPrimitive for f64 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f64>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float64 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Float64 downcast 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float32 => {
+                let casted = compute::cast(array, &DataType::Float64)
+                    .map_err(|e| MahoutError::Io(format!("Arrow cast f32→f64: 
{e}")))?;
+                let arr = casted
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Cast to Float64 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            other => {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected Float32 or Float64 values, got {other:?}"
+                )));
+            }
+        }
+        Ok(())
+    }
+}
+
+impl ArrowPrimitive for f32 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f32>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float32 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float32Array>()
+                    .ok_or_else(|| MahoutError::Io("Float32 downcast 
failed".to_string()))?;
+                handle_float32_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float64 => {
+                // f64 → f32: values outside f32 range become ±Inf; NaN is 
preserved
+                let casted = compute::cast(array, &DataType::Float32)
+                    .map_err(|e| MahoutError::Io(format!("Arrow cast f64→f32: 
{e}")))?;
+                let arr = casted
+                    .as_any()
+                    .downcast_ref::<Float32Array>()
+                    .ok_or_else(|| MahoutError::Io("Cast to Float32 
failed".to_string()))?;
+                handle_float32_nulls(output, arr, null_handling)?;
+            }
+            other => {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected Float32 or Float64 values, got {other:?}"
+                )));
+            }
+        }
+        Ok(())
+    }
+}
+
+// ---------------------------------------------------------------------------
+// Shared schema validator
+// ---------------------------------------------------------------------------
+
+fn validate_float_list_schema(field: &arrow::datatypes::Field) -> Result<()> {

Review Comment:
   The List and FixedSizeList arms run the identical `matches!(.., Float32 | 
Float64)` check. A tiny `fn is_supported_float(dt: &DataType) -> bool` reused 
here + in the scalar validator below would keep the accepted set in one place. 
nit.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -353,13 +455,13 @@ impl ParquetStreamingReader {
     }
 }
 
-impl DataReader for ParquetStreamingReader {
-    fn read_batch(&mut self) -> Result<(Vec<f64>, usize, usize)> {
+impl<T: FloatElem + ArrowPrimitive> DataReader<T> for 
ParquetStreamingReader<T> {
+    fn read_batch(&mut self) -> Result<(Vec<T>, usize, usize)> {
         let mut all_data = Vec::new();
         let mut num_samples = 0;
 
         loop {
-            let mut buffer = vec![0.0; 1024 * 1024]; // 1M elements buffer
+            let mut buffer = vec![T::default(); 1024 * 1024];

Review Comment:
   This 1M-element buffer gets allocated + zero-filled on every loop iteration 
(pre-existing, but now `T::default()`). It's fully overwritten up to `written`, 
so it can be hoisted out of the loop and reused. Minor.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -17,24 +17,180 @@
 //! Parquet format reader implementation.
 
 use std::fs::File;
+use std::marker::PhantomData;
 use std::path::Path;
 
-use arrow::array::{Array, FixedSizeListArray, Float64Array, ListArray};
+use arrow::array::{Array, FixedSizeListArray, Float32Array, Float64Array, 
ListArray};
+use arrow::compute;
 use arrow::datatypes::DataType;
 use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
 
 use crate::error::{MahoutError, Result};
-use crate::reader::{DataReader, NullHandling, StreamingDataReader, 
handle_float64_nulls};
+use crate::reader::{
+    DataReader, FloatElem, NullHandling, StreamingDataReader, 
handle_float32_nulls,
+    handle_float64_nulls,
+};
 
-/// Reader for Parquet files containing List<Float64> or 
FixedSizeList<Float64> columns.
-pub struct ParquetReader {
+// ---------------------------------------------------------------------------
+// Internal sealed helper trait
+// ---------------------------------------------------------------------------
+
+/// Zero-copy or Arrow-cast extraction from an Arrow array into an output 
`Vec<Self>`.
+///
+/// Implemented for `f32` and `f64` only:
+/// - same dtype → `extend_from_slice` directly from the Arrow buffer (zero 
alloc)
+/// - cross dtype → `arrow::compute::cast` first, then `extend_from_slice`
+///   - f32 → f64: exact, NaN preserved
+///   - f64 → f32: values outside f32 range become ±Inf, NaN preserved
+pub(crate) trait ArrowPrimitive: FloatElem {
+    fn extend_from_arrow_array(
+        output: &mut Vec<Self>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()>;
+
+    fn collect_from_arrow_array(
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<Vec<Self>> {
+        let mut out = Vec::new();
+        Self::extend_from_arrow_array(&mut out, array, null_handling)?;
+        Ok(out)
+    }
+}
+
+impl ArrowPrimitive for f64 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f64>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float64 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Float64 downcast 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float32 => {
+                let casted = compute::cast(array, &DataType::Float64)
+                    .map_err(|e| MahoutError::Io(format!("Arrow cast f32→f64: 
{e}")))?;
+                let arr = casted
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Cast to Float64 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            other => {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected Float32 or Float64 values, got {other:?}"
+                )));
+            }
+        }
+        Ok(())
+    }
+}
+
+impl ArrowPrimitive for f32 {

Review Comment:
   These two `ArrowPrimitive` impls are basically mirror images (same-dtype 
downcast + cross-dtype cast, identical wiring). Could collapse into one generic 
helper + an associated type, e.g.:
   ```rust
   trait ArrowPrimitive: FloatElem { type Arrow: ArrowPrimitiveType<Native = 
Self>; /* default method */ }
   impl ArrowPrimitive for f64 { type Arrow = Float64Type; }
   impl ArrowPrimitive for f32 { type Arrow = Float32Type; }
   ```
   with a single `extend_floats::<P>(...)`. Cuts ~60 lines and keeps the 
zero-copy fast path. Not blocking, just nice-to-have.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -17,24 +17,180 @@
 //! Parquet format reader implementation.
 
 use std::fs::File;
+use std::marker::PhantomData;
 use std::path::Path;
 
-use arrow::array::{Array, FixedSizeListArray, Float64Array, ListArray};
+use arrow::array::{Array, FixedSizeListArray, Float32Array, Float64Array, 
ListArray};
+use arrow::compute;
 use arrow::datatypes::DataType;
 use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
 
 use crate::error::{MahoutError, Result};
-use crate::reader::{DataReader, NullHandling, StreamingDataReader, 
handle_float64_nulls};
+use crate::reader::{
+    DataReader, FloatElem, NullHandling, StreamingDataReader, 
handle_float32_nulls,
+    handle_float64_nulls,
+};
 
-/// Reader for Parquet files containing List<Float64> or 
FixedSizeList<Float64> columns.
-pub struct ParquetReader {
+// ---------------------------------------------------------------------------
+// Internal sealed helper trait
+// ---------------------------------------------------------------------------
+
+/// Zero-copy or Arrow-cast extraction from an Arrow array into an output 
`Vec<Self>`.
+///
+/// Implemented for `f32` and `f64` only:
+/// - same dtype → `extend_from_slice` directly from the Arrow buffer (zero 
alloc)
+/// - cross dtype → `arrow::compute::cast` first, then `extend_from_slice`
+///   - f32 → f64: exact, NaN preserved
+///   - f64 → f32: values outside f32 range become ±Inf, NaN preserved
+pub(crate) trait ArrowPrimitive: FloatElem {
+    fn extend_from_arrow_array(
+        output: &mut Vec<Self>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()>;
+
+    fn collect_from_arrow_array(
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<Vec<Self>> {
+        let mut out = Vec::new();
+        Self::extend_from_arrow_array(&mut out, array, null_handling)?;
+        Ok(out)
+    }
+}
+
+impl ArrowPrimitive for f64 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f64>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float64 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Float64 downcast 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float32 => {
+                let casted = compute::cast(array, &DataType::Float64)
+                    .map_err(|e| MahoutError::Io(format!("Arrow cast f32→f64: 
{e}")))?;
+                let arr = casted
+                    .as_any()
+                    .downcast_ref::<Float64Array>()
+                    .ok_or_else(|| MahoutError::Io("Cast to Float64 
failed".to_string()))?;
+                handle_float64_nulls(output, arr, null_handling)?;
+            }
+            other => {
+                return Err(MahoutError::InvalidInput(format!(
+                    "Expected Float32 or Float64 values, got {other:?}"
+                )));
+            }
+        }
+        Ok(())
+    }
+}
+
+impl ArrowPrimitive for f32 {
+    fn extend_from_arrow_array(
+        output: &mut Vec<f32>,
+        array: &dyn Array,
+        null_handling: NullHandling,
+    ) -> Result<()> {
+        match array.data_type() {
+            DataType::Float32 => {
+                let arr = array
+                    .as_any()
+                    .downcast_ref::<Float32Array>()
+                    .ok_or_else(|| MahoutError::Io("Float32 downcast 
failed".to_string()))?;
+                handle_float32_nulls(output, arr, null_handling)?;
+            }
+            DataType::Float64 => {

Review Comment:
   Heads up: `ParquetReader::<f32>` will happily take an f64 file and silently 
downcast — out-of-range → ±Inf with no warning/error. That's an easy way to get 
garbage into encoding without noticing. Is a strict mode (or at least a 
`log::warn`) worth it? Also wondering if this should tie into the existing 
`Precision`/`Dtype` instead of an ad-hoc cast. Genuine question, fine to punt 
to #1338.



##########
qdp/qdp-core/src/readers/parquet.rs:
##########
@@ -644,13 +713,17 @@ mod tests {
         let array = Arc::new(Int32Array::from(vec![1, 2, 3])) as ArrayRef;
         let file = write_test_parquet(schema, vec![array]);
 
-        let result = ParquetReader::new(file.path(), None, 
NullHandling::FillZero);
+        let result = ParquetReader::<f64>::new(file.path(), None, 
NullHandling::FillZero);
         assert!(result.is_err());
         let err_msg = match result {
             Err(e) => e.to_string(),
             Ok(_) => panic!(),
         };
-        assert!(err_msg.contains("Expected List<Float64> or 
FixedSizeList<Float64> column"));
+        assert!(
+            err_msg.contains("Expected List")

Review Comment:
   This assertion got pretty loose — the `|| contains("Float32/Float64")` 
substring shows up in almost every error message in this file, so the test 
would pass even if the wrong validation branch fired. Maybe assert on the 
actual `Int32`/`List<` dtype in the message instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Feature][QDP]ParquetReader generic over f32/f64 with Arrow cast [mahout]

Reply via email to