Copilot commented on code in PR #1938: URL: https://github.com/apache/auron/pull/1938#discussion_r2925245188
########## native-engine/datafusion-ext-exprs/src/spark_randn.rs: ########## @@ -0,0 +1,303 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +use std::{ + any::Any, + fmt::{Debug, Display, Formatter}, + hash::{Hash, Hasher}, + sync::Arc, +}; + +use arrow::{ + array::{Float64Array, RecordBatch}, + datatypes::{DataType, Schema}, +}; +use datafusion::{ + common::Result, + logical_expr::ColumnarValue, + physical_expr::{PhysicalExpr, PhysicalExprRef}, +}; +use parking_lot::Mutex; +use rand::{SeedableRng, rngs::StdRng}; +use rand_distr::{Distribution, StandardNormal}; + +use crate::down_cast_any_ref; + +/// Returns random values with independent and identically distributed (i.e.d.) +/// samples drawn from the standard normal distribution. +/// +/// Matches Spark's behavior: +/// - RNG is seeded with `seed + partition_id` +/// - RNG state advances for each row (stateful across batches) +pub struct SparkRandnExpr { + seed: i64, + partition_id: usize, + rng: Mutex<StdRng>, +} + +impl SparkRandnExpr { + pub fn new(seed: i64, partition_id: usize) -> Self { + let effective_seed = (seed as u64).wrapping_add(partition_id as u64); + Self { + seed, + partition_id, + rng: Mutex::new(StdRng::seed_from_u64(effective_seed)), + } Review Comment: The native implementation uses `StdRng` + `rand_distr::StandardNormal` to generate values, but Spark's `randn(seed)` is deterministic based on Spark's own RNG (seeded with `seed + partitionId`) and its Gaussian generation algorithm. With the current RNG choice, results for a fixed seed are very unlikely to match vanilla Spark exactly, which will break `checkSparkAnswerAndOperator` comparisons. Consider implementing Spark-compatible RNG/gaussian generation (matching Spark's `XORShiftRandom` / `nextGaussian` behavior) and add a golden test asserting the first N outputs for a known (seed, partition_id) pair match Spark. ########## native-engine/datafusion-ext-exprs/src/spark_randn.rs: ########## @@ -0,0 +1,303 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +use std::{ + any::Any, + fmt::{Debug, Display, Formatter}, + hash::{Hash, Hasher}, + sync::Arc, +}; + +use arrow::{ + array::{Float64Array, RecordBatch}, + datatypes::{DataType, Schema}, +}; +use datafusion::{ + common::Result, + logical_expr::ColumnarValue, + physical_expr::{PhysicalExpr, PhysicalExprRef}, +}; +use parking_lot::Mutex; +use rand::{SeedableRng, rngs::StdRng}; +use rand_distr::{Distribution, StandardNormal}; + +use crate::down_cast_any_ref; + +/// Returns random values with independent and identically distributed (i.e.d.) Review Comment: Typo in the doc comment: “i.e.d.” should be “i.i.d.” (independent and identically distributed). ```suggestion /// Returns random values with independent and identically distributed (i.i.d.) ``` ########## native-engine/datafusion-ext-functions/Cargo.toml: ########## @@ -38,3 +38,5 @@ serde_json = { workspace = true } sonic-rs = { workspace = true } chrono = "0.4.43" chrono-tz = "0.10.4" +rand = { workspace = true } +rand_distr = { workspace = true } Review Comment: `rand`/`rand_distr` were added as dependencies here, but this crate's source currently doesn't reference them. If `randn` moved to `datafusion-ext-exprs`, consider removing these dependencies from `datafusion-ext-functions` to avoid unnecessary dependency/compile-time bloat. ```suggestion ``` ########## spark-extension-shims-spark/src/test/scala/org/apache/auron/AuronFunctionSuite.scala: ########## @@ -650,4 +650,23 @@ class AuronFunctionSuite extends AuronQueryTest with BaseAuronSQLSuite { } } } + + test("randn function with seed") { + withTable("t1") { + sql("CREATE TABLE t1(id INT) USING parquet") + sql("INSERT INTO t1 VALUES(1), (2), (3)") + + // Test randn with different seeds - should produce reproducible results + val query = + """ + |SELECT + | id, + | randn(42) AS randn_seed_42, + | randn(100) AS randn_seed_100 + |FROM t1 + |""".stripMargin + + checkSparkAnswerAndOperator(query) Review Comment: The new randn test only covers literal seeds. Given the converter currently pattern-matches only `Literal` seeds, it would be useful to add coverage for a foldable-but-non-literal seed expression (e.g., `randn(cast(42 as bigint))` or `randn(42 + 0)`) to ensure native execution matches vanilla Spark for common SQL usage patterns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
