klion26 commented on code in PR #8516:
URL: https://github.com/apache/arrow-rs/pull/8516#discussion_r2404021331
##########
parquet-variant-compute/src/type_conversion.rs:
##########
@@ -38,12 +38,33 @@ pub(crate) trait PrimitiveFromVariant: ArrowPrimitiveType {
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native>;
}
+/// Extension trait for Arrow timestamp types that can extract their native
value from a Variant
+/// We can't use [`PrimitiveFromVariant`] directly because we might need to
use methods that
+/// are only available on [`ArrowTimestampType`] (such as with_timezone_opt)
+pub(crate) trait TimestampFromVariant: ArrowTimestampType {
+ fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native>;
+}
+
/// Macro to generate PrimitiveFromVariant implementations for Arrow primitive
types
macro_rules! impl_primitive_from_variant {
- ($arrow_type:ty, $variant_method:ident) => {
+ ($arrow_type:ty, $variant_method:ident $(, $cast_fn:expr)?) => {
impl PrimitiveFromVariant for $arrow_type {
fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native>
{
- variant.$variant_method()
+ let value = variant.$variant_method();
+ $( let value = value.map($cast_fn); )?
+ value
+ }
+ }
+ };
+ ($arrow_type:ty, $( $variant_type:pat => $variant_method:ident,
$cast_fn:expr ),+ $(,)?) => {
+ impl TimestampFromVariant for $arrow_type {
+ fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native>
{
+ match variant {
+ $(
+ $variant_type =>
variant.$variant_method().map($cast_fn),
+ )+
+ _ => None
+ }
Review Comment:
Ok, I'll try with this way.
> But the current code also allows invalid conversions, such as interpreting
an NTZ timestamp as UTC, because the current as_timestamp_xxx methods are too
narrow of a waist and lose information.
Does this mean the `as_timestamp_xx` itself or the end-to-end of the variant
to arrow here? If it's the former, yes, it may be wrong (or maybe we can treat
the return value as the *physically* stored value), if it's the latter, we'll
attach the timezone info when initializing the builder
##########
parquet-variant-compute/src/type_conversion.rs:
##########
@@ -60,6 +65,44 @@ impl_primitive_from_variant!(datatypes::UInt64Type, as_u64);
impl_primitive_from_variant!(datatypes::Float16Type, as_f16);
impl_primitive_from_variant!(datatypes::Float32Type, as_f32);
impl_primitive_from_variant!(datatypes::Float64Type, as_f64);
+impl_primitive_from_variant!(
+ datatypes::Date32Type,
+ as_naive_date,
+ Date32Type::from_naive_date
+);
+
+pub(crate) trait TimestampFromVariant: ArrowTimestampType {
+ fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native>;
+}
+
+macro_rules! impl_timestamp_from_variant {
+ ($timestamp_type:ty, {
+ $(($variant_pattern:pat, $conversion:expr)),+ $(,)?
+ }) => {
+ impl TimestampFromVariant for $timestamp_type {
+ fn from_variant(variant: &Variant<'_, '_>) -> Option<Self::Native>
{
+ match variant {
+ $(
+ $variant_pattern => $conversion,
+ )+
+ _ => None,
+ }
+ }
+ }
+ };
+}
+
+impl_timestamp_from_variant!(TimestampMicrosecondType, {
+ (Variant::TimestampMicros(t), Some(t.timestamp_micros())),
+ (Variant::TimestampNtzMicros(t), Some(t.and_utc().timestamp_micros())),
+});
+
+impl_timestamp_from_variant!(TimestampNanosecondType, {
+ (Variant::TimestampMicros(t), Some(t.timestamp_micros()).map(|t| t *
1000)),
+ (Variant::TimestampNtzMicros(t),
Some(t.and_utc().timestamp_micros()).map(|t| t * 1000)),
+ (Variant::TimestampNanos(t), t.timestamp_nanos_opt()),
+ (Variant::TimestampNtzNanos(t), t.and_utc().timestamp_nanos_opt()),
+});
Review Comment:
> We can "safely" convert a TZ type to an NTZ type
No, maybe we can't do this, this will lead to the *wrong* result. The
timestamp(the long value) for tz was calculated between the time with
`1970-01-01 00:00:00 at +00:00`, and NTZ was calculated between the time with
`1970-01-01 00:00:00 in the local timezone`.
> But arrow doesn't distinguish physically between TZ and NTZ
IIUC, we don't need to distinguish these two when physically storing the
value; they both are the timestamp between *now* and some time point
(`1970-01-01 00:00:00 at +00:00 for TZ, and `1970-01-01 00:00:00 in the local
timezone` for NTZ )
> So maybe the correct approach will be to add
Variant::as_timestamp[_ntz]_[micros|nanos] methods,
Separate the tz and ntz version Variant::as_timestamp[_ntz]_[micro|nanos]
that returns `DateTime<Utc>` and `NaiveDateTime` seems a better idea here.
##########
parquet-variant/src/variant.rs:
##########
@@ -561,6 +561,72 @@ impl<'m, 'v> Variant<'m, 'v> {
}
}
+ /// Converts this variant to a `i64` representing microseconds since the
Unix epoch if possible.
+ /// This is useful when convert the variant to arrow types.
+ ///
+ /// Returns Some(i64) for [`Variant::TimestampMicros`] and
[`Variant::TimestampNtzMicros`],
+ /// None for the other variant types.
+ ///
+ /// ```
+ /// use parquet_variant::Variant;
+ /// use chrono::NaiveDate;
+ ///
+ /// // you can extract an i64 from Variant::TimestampMicros
+ /// let datetime = NaiveDate::from_ymd_opt(2025, 10,
03).unwrap().and_hms_milli_opt(12, 34, 56, 789).unwrap().and_utc();
+ /// let v1 = Variant::from(datetime);
+ /// assert_eq!(v1.as_timestamp_micros(), Some(1759494896789000));
+ ///
+ /// // or Variant::TimestampNtzMicros
+ /// let datetime_ntz = NaiveDate::from_ymd_opt(2025, 10,
03).unwrap().and_hms_milli_opt(12, 34, 56, 789).unwrap();
+ /// let v2 = Variant::from(datetime_ntz);
+ /// assert_eq!(v1.as_timestamp_micros(), Some(1759494896789000));
+ ///
+ /// // but not from other variants
+ /// let datetime_nanos = NaiveDate::from_ymd_opt(2025, 10,
03).unwrap().and_hms_nano_opt(12, 34, 56, 789123456).unwrap().and_utc();
+ /// let v3 = Variant::from(datetime_nanos);
+ /// assert_eq!(v3.as_timestamp_micros(), None);
+ /// ```
+ pub fn as_timestamp_micros(&self) -> Option<i64> {
+ match *self {
+ Variant::TimestampMicros(d) => Some(d.timestamp_micros()),
+ Variant::TimestampNtzMicros(d) =>
Some(d.and_utc().timestamp_micros()),
Review Comment:
Not sure if I fully understand this right. If the `lossy` here means that we
lost the timezone info, yes, it is. The `timestamp` here means the physically
stored value(with type long) for the `NaiveDateTime` and `DateTime<Utc>`. If we
return `Option<NaiveDateTime>/Option<DateTime<Utc>>` separate the ntz and tz
versions is a better idea, but when the return value is `Option<i64>`(the
underlying timestamp long value) then separate or not is the same?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]