adriangb commented on code in PR #20958:
URL: https://github.com/apache/datafusion/pull/20958#discussion_r2940913813


##########
datafusion/physical-plan/src/joins/hash_join/partitioned_hash_eval.rs:
##########
@@ -47,8 +47,16 @@ pub struct SeededRandomState {
 impl SeededRandomState {
     /// Create a new SeededRandomState with the given seeds.
     pub const fn with_seeds(k0: u64, k1: u64, k2: u64, k3: u64) -> Self {
+        // Combine 4 seeds into one for foldhash's single-seed API

Review Comment:
   Is the goal here to avoid changing the public signature? IIRC this is only 
public because it gets used across modules / crates -> maybe we can just change 
the signature? If that's the case it would be good to add to the docstring 
"this is only meant for internal use, we may break the method, etc.".



##########
datafusion/physical-plan/src/aggregates/mod.rs:
##########
@@ -87,8 +87,16 @@ pub fn topk_types_supported(key_type: &DataType, value_type: 
&DataType) -> bool
 }
 
 /// Hard-coded seed for aggregations to ensure hash values differ from 
`RepartitionExec`, avoiding collisions.
-const AGGREGATION_HASH_SEED: ahash::RandomState =
-    ahash::RandomState::with_seeds('A' as u64, 'G' as u64, 'G' as u64, 'R' as 
u64);
+const AGGREGATION_HASH_SEED: datafusion_common::hash_utils::RandomState =
+    datafusion_common::hash_utils::RandomState::with_seed(
+        ('A' as u64)
+            .wrapping_mul(6364136223846793005)

Review Comment:
   Could you add a comment here explaining to us mere mortals where this number 
comes from?



##########
datafusion/physical-plan/src/joins/symmetric_hash_join.rs:
##########
@@ -239,7 +239,7 @@ impl SymmetricHashJoinExec {
             build_join_schema(&left_schema, &right_schema, join_type);
 
         // Initialize the random state for the join operation:
-        let random_state = RandomState::with_seeds(0, 0, 0, 0);
+        let random_state = RandomState::with_seed(0);

Review Comment:
   I thought we abstracted out the 2 random states we use (`0,0,0,0` and 
`H,A,S,H`) somewhere - should we re-use those?



##########
datafusion/core/tests/sql/unparser.rs:
##########
@@ -368,8 +357,9 @@ async fn collect_results(ctx: &SessionContext, original: 
&str) -> TestCaseResult
         }
     };
 
-    // Sort if needed for comparison
-    if !is_sorted {
+    // Always sort for deterministic comparison — even "sorted" results can 
have
+    // tied rows in different order between original and unparsed SQL.

Review Comment:
   Ah makes sense 👍🏻 



##########
datafusion/functions-aggregate/src/hyperloglog.rs:
##########
@@ -61,12 +61,7 @@ where
 /// shared across cluster, this SEED will have to be consistent across all
 /// parties otherwise we might have corruption. So ideally for later this seed
 /// shall be part of the serialized form (or stay unchanged across versions).
-const SEED: RandomState = RandomState::with_seeds(
-    0x885f6cab121d01a3_u64,
-    0x71e4379f2976ad8f_u64,
-    0xbf30173dd28a8816_u64,
-    0x0eaea5d736d733a4_u64,
-);
+const SEED: foldhash::quality::FixedState = 
foldhash::quality::FixedState::with_seed(0);

Review Comment:
   Is this going to be problematic? It says in the comment "ideally [...] this 
seed shall [...] stay unchanged across versions..."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to