mbutrovich commented on code in PR #4753:
URL: https://github.com/apache/datafusion-comet/pull/4753#discussion_r3500038271
##########
native/spark-expr/src/datetime_funcs/extract_date_part.rs:
##########
@@ -98,3 +116,128 @@ macro_rules! extract_date_part {
extract_date_part!(SparkHour, "hour", Hour);
extract_date_part!(SparkMinute, "minute", Minute);
extract_date_part!(SparkSecond, "second", Second);
+
+#[cfg(test)]
+mod tests {
+ use super::*;
+ use arrow::array::{ArrayRef, Int32Array, TimestampMicrosecondArray};
+ use arrow::datatypes::{Field, Int8Type, TimeUnit};
+ use datafusion::config::ConfigOptions;
+ use std::sync::Arc;
+
+ // 2024-01-15 18:30:45 UTC, in microseconds since the epoch.
+ const MICROS: i64 = 1_705_343_445_000_000;
+
+ /// Invokes a single-argument date-part UDF on `array` and returns the
first extracted value.
+ fn invoke<U: ScalarUDFImpl>(udf: &U, array: ArrayRef) -> i32 {
+ let return_field = Arc::new(Field::new("v", DataType::Int32, true));
+ let args = ScalarFunctionArgs {
+ args: vec![ColumnarValue::Array(array)],
+ number_rows: 1,
+ return_field,
+ config_options: Arc::new(ConfigOptions::default()),
+ arg_fields: vec![],
+ };
+ match udf.invoke_with_args(args).unwrap() {
+ ColumnarValue::Array(arr) => {
+ arr.as_any().downcast_ref::<Int32Array>().unwrap().value(0)
+ }
+ _ => panic!("Expected array"),
+ }
+ }
+
+ fn ntz_array() -> ArrayRef {
+ Arc::new(TimestampMicrosecondArray::from(vec![Some(MICROS)]))
+ }
+
+ #[test]
+ fn timestamp_with_timezone_converts_to_session_timezone() {
+ // A timezone-aware timestamp stores the absolute UTC instant. In a
Los Angeles session
+ // (UTC-8 in January) the local hour is 18 - 8 = 10.
+ let udf = SparkHour::new("America/Los_Angeles".to_string());
+ let array =
TimestampMicrosecondArray::from(vec![Some(MICROS)]).with_timezone("UTC");
+ assert_eq!(invoke(&udf, Arc::new(array)), 10);
+ }
+
+ #[test]
+ fn timestamp_ntz_extracts_local_time_without_conversion() {
+ // TimestampNTZ stores local wall-clock time, so the hour is extracted
directly (18) with
+ // no session timezone offset applied (issue #3180).
+ let udf = SparkHour::new("America/Los_Angeles".to_string());
+ assert_eq!(invoke(&udf, ntz_array()), 18);
+ }
+
+ #[test]
+ fn timestamp_ntz_result_is_independent_of_session_timezone() {
+ // The extracted hour must be the same regardless of the session
timezone.
+ for tz in ["UTC", "America/Los_Angeles", "Asia/Tokyo"] {
+ let udf = SparkHour::new(tz.to_string());
+ assert_eq!(invoke(&udf, ntz_array()), 18);
+ }
+ }
+
+ #[test]
+ fn minute_and_second_on_timestamp_ntz() {
+ assert_eq!(
+ invoke(&SparkMinute::new("Asia/Tokyo".to_string()), ntz_array()),
+ 30
+ );
+ assert_eq!(
+ invoke(&SparkSecond::new("Asia/Tokyo".to_string()), ntz_array()),
+ 45
+ );
+ }
+
+ #[test]
+ fn is_timestamp_ntz_detects_plain_and_dictionary_wrapped() {
+ assert!(is_timestamp_ntz(&DataType::Timestamp(
+ TimeUnit::Microsecond,
+ None
+ )));
+ assert!(is_timestamp_ntz(&DataType::Dictionary(
+ Box::new(DataType::Int8),
+ Box::new(DataType::Timestamp(TimeUnit::Microsecond, None)),
+ )));
+ // Timezone-aware timestamps and dictionaries of them are not NTZ.
+ assert!(!is_timestamp_ntz(&DataType::Timestamp(
+ TimeUnit::Microsecond,
+ Some("UTC".into()),
+ )));
+ assert!(!is_timestamp_ntz(&DataType::Dictionary(
+ Box::new(DataType::Int8),
+ Box::new(DataType::Timestamp(
+ TimeUnit::Microsecond,
+ Some("UTC".into())
+ )),
+ )));
+ }
+
+ #[test]
+ fn timestamp_ntz_dictionary_input_extracts_local_time() {
+ use arrow::array::{DictionaryArray, Int8Array};
+ let values =
Arc::new(TimestampMicrosecondArray::from(vec![Some(MICROS)])) as ArrayRef;
+ let keys = Int8Array::from(vec![0i8]);
Review Comment:
Small test-fidelity thought. Comet's dictionary columns use `Int32` keys,
and `return_type` declares the dictionary result as `Dictionary(Int32, Int32)`.
arrow's `date_part` preserves the key type (it does `array.with_values(...)`),
so an `Int8`-keyed input actually comes back as `Dictionary(Int8, Int32)`,
which does not match the declared return type. The test gets away with it
because it casts the result to `Int32` before checking, which is what the
"regardless of whether the kernel kept the dictionary" comment is papering over.
Would it be worth switching this to an `Int32`-keyed dictionary so the test
exercises the real production shape and the declared return type? As a bonus
you could then assert the result is still a dictionary, which would tighten the
coverage rather than casting it away. Not a correctness problem in production
since Comet only emits `Int32` keys, just making the test match what actually
flows through.
##########
native/spark-expr/src/datetime_funcs/extract_date_part.rs:
##########
@@ -24,6 +24,17 @@ use datafusion::logical_expr::{
};
use std::fmt::Debug;
+/// Returns true when the type is a timestamp without a timezone (Spark's
TimestampNTZType),
+/// including when wrapped in a dictionary. Such values store local wall-clock
time and must not
+/// have any session timezone offset applied when extracting date parts.
+fn is_timestamp_ntz(data_type: &DataType) -> bool {
Review Comment:
Nice little helper, and recursing through the dictionary wrapper is the
right move. Purely a forward pointer, no action needed here: the
`CometTemporalExpressionSuite` comments note `date_trunc` has the same "do not
apply a session offset to NTZ" issue for non-UTC zones (#2649). If that one
gets picked up later, this same predicate looks like it would carry over, so it
might be worth keeping in mind as a shared piece rather than re-deriving it
there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]