scovich commented on code in PR #7644: URL: https://github.com/apache/arrow-rs/pull/7644#discussion_r2140522645
########## parquet-variant/src/decoder.rs: ########## @@ -33,7 +34,18 @@ pub enum VariantPrimitiveType { BooleanTrue = 1, BooleanFalse = 2, Int8 = 3, + Int16 = 4, + Int32 = 5, + Int64 = 6, + Decimal4 = 8, Review Comment: Double? ########## parquet-variant/src/decoder.rs: ########## @@ -33,7 +34,18 @@ pub enum VariantPrimitiveType { BooleanTrue = 1, BooleanFalse = 2, Int8 = 3, + Int16 = 4, + Int32 = 5, + Int64 = 6, + Decimal4 = 8, + Decimal8 = 9, + Decimal16 = 10, // TODO: Add types for the rest of primitives, once API is agreed upon + Date = 11, + TimestampMicros = 12, + TimestampNTZMicros = 13, Review Comment: Acronyms and camel case are always awkward, but I _think_ we normally do init-caps instead of all-caps in rust? ```suggestion TimestampNtzMicros = 13, ``` ########## parquet-variant/src/decoder.rs: ########## @@ -90,6 +113,89 @@ pub(crate) fn decode_int8(value: &[u8]) -> Result<i8, ArrowError> { Ok(value) } +/// Decodes an Int16 from the value section of a variant. +pub(crate) fn decode_int16(value: &[u8]) -> Result<i16, ArrowError> { + let value = i16::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) Review Comment: Single-use variable considered annoying if it doesn't help line breaks etc? ```suggestion Ok(i16::from_le_bytes(array_from_slice(value, 1)?)) ``` (several more below, and one already-existing above) ########## parquet-variant/src/decoder.rs: ########## @@ -90,6 +113,89 @@ pub(crate) fn decode_int8(value: &[u8]) -> Result<i8, ArrowError> { Ok(value) } +/// Decodes an Int16 from the value section of a variant. +pub(crate) fn decode_int16(value: &[u8]) -> Result<i16, ArrowError> { + let value = i16::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int32 from the value section of a variant. +pub(crate) fn decode_int32(value: &[u8]) -> Result<i32, ArrowError> { + let value = i32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int64 from the value section of a variant. +pub(crate) fn decode_int64(value: &[u8]) -> Result<i64, ArrowError> { + let value = i64::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Decimal4 from the value section of a variant. +pub(crate) fn decode_decimal4(value: &[u8]) -> Result<(i32, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i32::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal8 from the value section of a variant. +pub(crate) fn decode_decimal8(value: &[u8]) -> Result<(i64, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i64::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal16 from the value section of a variant. +pub(crate) fn decode_decimal16(value: &[u8]) -> Result<(i128, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i128::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Float from the value section of a variant. +pub(crate) fn decode_float(value: &[u8]) -> Result<f32, ArrowError> { + let value = f32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Date from the value section of a variant. +pub(crate) fn decode_date(value: &[u8]) -> Result<NaiveDate, ArrowError> { + let days_since_epoch = i32::from_le_bytes(array_from_slice(value, 1)?); + let value = (DateTime::UNIX_EPOCH + Duration::days(days_since_epoch as i64)).date_naive(); + Ok(value) Review Comment: I'm not sure what the official rules are, but I prefer to avoid `as` casting of integral types as error-prone. ```suggestion let value = DateTime::UNIX_EPOCH + Duration::days(i64::from(days_since_epoch))); Ok(value.date_naive()) ``` (tl;dr: Somebody reading the code has to stop and think about whether the `as` is safe/correct, vs. `from` that is obviously safe) ########## parquet-variant/src/decoder.rs: ########## @@ -33,7 +34,18 @@ pub enum VariantPrimitiveType { BooleanTrue = 1, BooleanFalse = 2, Int8 = 3, + Int16 = 4, + Int32 = 5, + Int64 = 6, + Decimal4 = 8, + Decimal8 = 9, + Decimal16 = 10, // TODO: Add types for the rest of primitives, once API is agreed upon + Date = 11, + TimestampMicros = 12, + TimestampNTZMicros = 13, + Float = 14, + Binary = 15, String = 16, } Review Comment: TimeNtz, TimestampNanos, TimestampNtzNanos? Uuid? (no strong opinion about whether this PR needs to support them, vs. future PR... but PR title and content should probably match?) ########## parquet-variant/src/decoder.rs: ########## @@ -90,6 +113,89 @@ pub(crate) fn decode_int8(value: &[u8]) -> Result<i8, ArrowError> { Ok(value) } +/// Decodes an Int16 from the value section of a variant. +pub(crate) fn decode_int16(value: &[u8]) -> Result<i16, ArrowError> { + let value = i16::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int32 from the value section of a variant. +pub(crate) fn decode_int32(value: &[u8]) -> Result<i32, ArrowError> { + let value = i32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int64 from the value section of a variant. +pub(crate) fn decode_int64(value: &[u8]) -> Result<i64, ArrowError> { + let value = i64::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Decimal4 from the value section of a variant. +pub(crate) fn decode_decimal4(value: &[u8]) -> Result<(i32, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i32::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal8 from the value section of a variant. +pub(crate) fn decode_decimal8(value: &[u8]) -> Result<(i64, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i64::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal16 from the value section of a variant. +pub(crate) fn decode_decimal16(value: &[u8]) -> Result<(i128, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i128::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Float from the value section of a variant. +pub(crate) fn decode_float(value: &[u8]) -> Result<f32, ArrowError> { + let value = f32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Date from the value section of a variant. +pub(crate) fn decode_date(value: &[u8]) -> Result<NaiveDate, ArrowError> { + let days_since_epoch = i32::from_le_bytes(array_from_slice(value, 1)?); + let value = (DateTime::UNIX_EPOCH + Duration::days(days_since_epoch as i64)).date_naive(); + Ok(value) +} + +/// Decodes a TimestampMicros from the value section of a variant. +pub(crate) fn decode_timestamp_micros(value: &[u8]) -> Result<DateTime<Utc>, ArrowError> { + let micros_since_epoch = i64::from_le_bytes(array_from_slice(value, 1)?); + if let Some(value) = DateTime::from_timestamp_micros(micros_since_epoch) { + Ok(value) + } else { + Err(ArrowError::CastError(format!( + "Could not cast `{micros_since_epoch}` microseconds into a DateTime<Utc>" + ))) + } Review Comment: `ok_or_else`? ```suggestion DateTime::from_timestamp_micros(micros_since_epoch).ok_or_else(|| { ArrowError::CastError(format!( "Could not convert `{micros_since_epoch}` to a microsecond timestamp" )) }) ``` (re: error message change, see below) ########## parquet-variant/src/decoder.rs: ########## @@ -112,14 +218,222 @@ mod tests { #[test] fn test_i8() -> Result<(), ArrowError> { let value = [ - 3 << 2, // Primitive type for i8 - 42, + (VariantPrimitiveType::Int8 as u8) << 2, // Basic type + 0x2a, // Data ]; let result = decode_int8(&value)?; assert_eq!(result, 42); Review Comment: Should we also change this to 0x2a? ########## parquet-variant/src/decoder.rs: ########## @@ -112,14 +218,222 @@ mod tests { #[test] fn test_i8() -> Result<(), ArrowError> { let value = [ - 3 << 2, // Primitive type for i8 - 42, + (VariantPrimitiveType::Int8 as u8) << 2, // Basic type + 0x2a, // Data ]; let result = decode_int8(&value)?; assert_eq!(result, 42); Ok(()) } + #[test] + fn test_i16() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int16 as u8) << 2, // Basic type + 0xd2, + 0x04, // Data + ]; + let result = decode_int16(&value)?; + assert_eq!(result, 1234); Review Comment: 0x04d2? Writing the number as hex makes clear that the test is about getting the bytes where they belong. Otherwise, most humans will have to go to their calculator to check the hex for 1234. ########## parquet-variant/src/decoder.rs: ########## @@ -112,14 +218,222 @@ mod tests { #[test] fn test_i8() -> Result<(), ArrowError> { let value = [ - 3 << 2, // Primitive type for i8 - 42, + (VariantPrimitiveType::Int8 as u8) << 2, // Basic type + 0x2a, // Data ]; let result = decode_int8(&value)?; assert_eq!(result, 42); Ok(()) } + #[test] + fn test_i16() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int16 as u8) << 2, // Basic type + 0xd2, + 0x04, // Data + ]; + let result = decode_int16(&value)?; + assert_eq!(result, 1234); + Ok(()) + } + + #[test] + fn test_i32() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int32 as u8) << 2, // Basic type + 0x40, + 0xe2, + 0x01, + 0x00, // Data + ]; + let result = decode_int32(&value)?; + assert_eq!(result, 123456); + Ok(()) Review Comment: Something to consider: Unit tests that use `?` are hard to debug (don't get a line number for the actual failure, which can be "fun" if there's more than one `?` in the test. ```suggestion assert_eq!(decode_int32(&value), Ok(123456)); ``` (tho in the general case, a test littered with `.unwrap()` calls can be annoying in a different way, so there's no really good answer unless/until rust comes up with a way for `?` to panic in tests) ########## parquet-variant/src/variant.rs: ########## @@ -402,11 +403,21 @@ pub enum Variant<'m, 'v> { // TODO: Add types for the rest of the primitive types, once API is agreed upon Null, Int8(i8), - + Int16(i16), + Int32(i32), + Int64(i64), + Date(NaiveDate), + TimestampMicros(DateTime<Utc>), + TimestampNTZMicros(NaiveDateTime), + Decimal4 { integer: i32, scale: u8 }, + Decimal8 { integer: i64, scale: u8 }, + Decimal16 { integer: i128, scale: u8 }, Review Comment: It's hard to pass around the content of tuple/struct enum variants with multiple members -- you have to unpack them and pass scalar values around, which is bulky and error-prone (**). Should we consider creating a struct or family of structs for representing scalar decimal values? <details> <summary>(**)</summary> It would be lovely if `Variant::Decimal4` were an actual type in rust. It actually _is_ a type under the hood, but the language doesn't expose it as such, because the enum discriminant has to be stored somewhere and rust usually embeds it in the variant itself to save space. See https://rust-lang.github.io/rfcs/2195-really-tagged-unions.html#guide-level-explanation for gory details. </details> ########## parquet-variant/src/variant.rs: ########## @@ -483,6 +632,57 @@ impl<'m, 'v> From<i8> for Variant<'m, 'v> { } } +impl<'m, 'v> From<i16> for Variant<'m, 'v> { + fn from(value: i16) -> Self { + Variant::Int16(value) + } +} + +impl<'m, 'v> From<i32> for Variant<'m, 'v> { Review Comment: I'm not sure how useful these `From` would be in practice, especially once a variant builder/encoder API lands? What use case(s) would they serve? Are they designed to make unit tests easier to write? Do we anticipate the variant builder/encoder would use them internally? Also: * `Variant::String` and `Variant::ShortString` are problematic because there's no way to e.g. embed one in a `Variant::Object` later on. * We anyway couldn't create `Variant::Object` or `Variant::Array` (they would need a mutable `VariantMetadataBuilder` to store field names in, etc.) ########## parquet-variant/src/decoder.rs: ########## @@ -112,14 +218,222 @@ mod tests { #[test] fn test_i8() -> Result<(), ArrowError> { let value = [ - 3 << 2, // Primitive type for i8 - 42, + (VariantPrimitiveType::Int8 as u8) << 2, // Basic type + 0x2a, // Data ]; let result = decode_int8(&value)?; assert_eq!(result, 42); Ok(()) } + #[test] + fn test_i16() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int16 as u8) << 2, // Basic type + 0xd2, + 0x04, // Data + ]; + let result = decode_int16(&value)?; + assert_eq!(result, 1234); + Ok(()) + } + + #[test] + fn test_i32() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int32 as u8) << 2, // Basic type + 0x40, + 0xe2, + 0x01, + 0x00, // Data + ]; + let result = decode_int32(&value)?; + assert_eq!(result, 123456); + Ok(()) + } + + #[test] + fn test_i64() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int64 as u8) << 2, // Basic type + 0x15, + 0x81, + 0xe9, + 0x7d, + 0xf4, + 0x10, + 0x22, + 0x11, // Data + ]; + let result = decode_int64(&value)?; + assert_eq!(result, 1234567890123456789); + Ok(()) + } + + #[test] + fn test_decimal4() -> Result<(), ArrowError> { Review Comment: That said... it would probably be very hard to test every possible way things could go wrong, so some kind of fuzz testing would probably be necessary to get beyond basic bad-input testing? ########## parquet-variant/tests/variant_interop.rs: ########## @@ -46,22 +47,22 @@ fn get_primitive_cases() -> Vec<(&'static str, Variant<'static, 'static>)> { // Cases are commented out // Enabling is tracked in https://github.com/apache/arrow-rs/issues/7630 vec![ - // ("primitive_binary", Variant::Binary), + ("primitive_binary", Variant::Binary(&[0x03, 0x13, 0x37, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe])), ("primitive_boolean_false", Variant::BooleanFalse), ("primitive_boolean_true", Variant::BooleanTrue), - // ("primitive_date", Variant::Null), - //("primitive_decimal4", Variant::Null), - //("primitive_decimal8", Variant::Null), - //("primitive_decimal16", Variant::Null), - //("primitive_float", Variant::Null), + ("primitive_date", Variant::Date(NaiveDate::from_ymd_opt(2025, 4 , 16).unwrap())), + ("primitive_decimal4", Variant::Decimal4{integer: 1234, scale: 2}), + ("primitive_decimal8", Variant::Decimal8{integer: 1234567890, scale: 2}), + ("primitive_decimal16", Variant::Decimal16{integer: 1234567891234567890, scale: 2}), + ("primitive_float", Variant::Float(1234567890.1234)), ("primitive_int8", Variant::Int8(42)), - //("primitive_int16", Variant::Null), - //("primitive_int32", Variant::Null), - //("primitive_int64", Variant::Null), + ("primitive_int16", Variant::Int16(1234)), + ("primitive_int32", Variant::Int32(123456)), + ("primitive_int64", Variant::Int64(1234567890123456789)), ("primitive_null", Variant::Null), ("primitive_string", Variant::String("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")), - //("primitive_timestamp", Variant::Null), - //("primitive_timestampntz", Variant::Null), + ("primitive_timestamp", Variant::TimestampMicros(NaiveDate::from_ymd_opt(2025, 4, 16).unwrap().and_hms_milli_opt(16, 34, 56, 780).unwrap().and_utc())), Review Comment: Those seem to be the same two values? Did I misread something? ########## parquet-variant/src/decoder.rs: ########## @@ -112,14 +218,222 @@ mod tests { #[test] fn test_i8() -> Result<(), ArrowError> { let value = [ - 3 << 2, // Primitive type for i8 - 42, + (VariantPrimitiveType::Int8 as u8) << 2, // Basic type + 0x2a, // Data ]; let result = decode_int8(&value)?; assert_eq!(result, 42); Ok(()) } + #[test] + fn test_i16() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int16 as u8) << 2, // Basic type + 0xd2, + 0x04, // Data + ]; + let result = decode_int16(&value)?; + assert_eq!(result, 1234); + Ok(()) + } + + #[test] + fn test_i32() -> Result<(), ArrowError> { + let value = [ + (VariantPrimitiveType::Int32 as u8) << 2, // Basic type + 0x40, + 0xe2, + 0x01, + 0x00, // Data + ]; + let result = decode_int32(&value)?; + assert_eq!(result, 123456); + Ok(()) Review Comment: Well, _that_ turned into a rabbit hole: https://internals.rust-lang.org/t/misusing-v2-try-and-to-make-unit-tests-nicer/23079/3 Feel free to ignore this whole distraction. ########## parquet-variant/src/decoder.rs: ########## @@ -90,6 +113,89 @@ pub(crate) fn decode_int8(value: &[u8]) -> Result<i8, ArrowError> { Ok(value) } +/// Decodes an Int16 from the value section of a variant. +pub(crate) fn decode_int16(value: &[u8]) -> Result<i16, ArrowError> { + let value = i16::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int32 from the value section of a variant. +pub(crate) fn decode_int32(value: &[u8]) -> Result<i32, ArrowError> { + let value = i32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int64 from the value section of a variant. +pub(crate) fn decode_int64(value: &[u8]) -> Result<i64, ArrowError> { + let value = i64::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Decimal4 from the value section of a variant. +pub(crate) fn decode_decimal4(value: &[u8]) -> Result<(i32, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i32::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal8 from the value section of a variant. +pub(crate) fn decode_decimal8(value: &[u8]) -> Result<(i64, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i64::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal16 from the value section of a variant. +pub(crate) fn decode_decimal16(value: &[u8]) -> Result<(i128, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i128::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Float from the value section of a variant. +pub(crate) fn decode_float(value: &[u8]) -> Result<f32, ArrowError> { + let value = f32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Date from the value section of a variant. +pub(crate) fn decode_date(value: &[u8]) -> Result<NaiveDate, ArrowError> { + let days_since_epoch = i32::from_le_bytes(array_from_slice(value, 1)?); + let value = (DateTime::UNIX_EPOCH + Duration::days(days_since_epoch as i64)).date_naive(); + Ok(value) +} + +/// Decodes a TimestampMicros from the value section of a variant. +pub(crate) fn decode_timestamp_micros(value: &[u8]) -> Result<DateTime<Utc>, ArrowError> { + let micros_since_epoch = i64::from_le_bytes(array_from_slice(value, 1)?); + if let Some(value) = DateTime::from_timestamp_micros(micros_since_epoch) { + Ok(value) + } else { + Err(ArrowError::CastError(format!( + "Could not cast `{micros_since_epoch}` microseconds into a DateTime<Utc>" + ))) + } +} + +/// Decodes a TimestampNTZMicros from the value section of a variant. +pub(crate) fn decode_timestampntz_micros(value: &[u8]) -> Result<NaiveDateTime, ArrowError> { + let micros_since_epoch = i64::from_le_bytes(array_from_slice(value, 1)?); + if let Some(value) = DateTime::from_timestamp_micros(micros_since_epoch) { + Ok(value.naive_utc()) + } else { + Err(ArrowError::CastError(format!( + "Could not cast `{micros_since_epoch}` microseconds into a NaiveDateTime" + ))) + } Review Comment: ```suggestion let value = decode_timestampntx_micros(value)?; Ok(value.naive_utc()) ``` ########## parquet-variant/src/decoder.rs: ########## @@ -90,6 +113,89 @@ pub(crate) fn decode_int8(value: &[u8]) -> Result<i8, ArrowError> { Ok(value) } +/// Decodes an Int16 from the value section of a variant. +pub(crate) fn decode_int16(value: &[u8]) -> Result<i16, ArrowError> { + let value = i16::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int32 from the value section of a variant. +pub(crate) fn decode_int32(value: &[u8]) -> Result<i32, ArrowError> { + let value = i32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes an Int64 from the value section of a variant. +pub(crate) fn decode_int64(value: &[u8]) -> Result<i64, ArrowError> { + let value = i64::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Decimal4 from the value section of a variant. +pub(crate) fn decode_decimal4(value: &[u8]) -> Result<(i32, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i32::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal8 from the value section of a variant. +pub(crate) fn decode_decimal8(value: &[u8]) -> Result<(i64, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i64::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Decimal16 from the value section of a variant. +pub(crate) fn decode_decimal16(value: &[u8]) -> Result<(i128, u8), ArrowError> { + let scale = u8::from_le_bytes(array_from_slice(value, 1)?); + let integer = i128::from_le_bytes(array_from_slice(value, 2)?); + Ok((integer, scale)) +} + +/// Decodes a Float from the value section of a variant. +pub(crate) fn decode_float(value: &[u8]) -> Result<f32, ArrowError> { + let value = f32::from_le_bytes(array_from_slice(value, 1)?); + Ok(value) +} + +/// Decodes a Date from the value section of a variant. +pub(crate) fn decode_date(value: &[u8]) -> Result<NaiveDate, ArrowError> { + let days_since_epoch = i32::from_le_bytes(array_from_slice(value, 1)?); + let value = (DateTime::UNIX_EPOCH + Duration::days(days_since_epoch as i64)).date_naive(); + Ok(value) +} + +/// Decodes a TimestampMicros from the value section of a variant. +pub(crate) fn decode_timestamp_micros(value: &[u8]) -> Result<DateTime<Utc>, ArrowError> { + let micros_since_epoch = i64::from_le_bytes(array_from_slice(value, 1)?); + if let Some(value) = DateTime::from_timestamp_micros(micros_since_epoch) { + Ok(value) + } else { + Err(ArrowError::CastError(format!( + "Could not cast `{micros_since_epoch}` microseconds into a DateTime<Utc>" + ))) + } +} + +/// Decodes a TimestampNTZMicros from the value section of a variant. +pub(crate) fn decode_timestampntz_micros(value: &[u8]) -> Result<NaiveDateTime, ArrowError> { + let micros_since_epoch = i64::from_le_bytes(array_from_slice(value, 1)?); + if let Some(value) = DateTime::from_timestamp_micros(micros_since_epoch) { + Ok(value.naive_utc()) + } else { + Err(ArrowError::CastError(format!( + "Could not cast `{micros_since_epoch}` microseconds into a NaiveDateTime" + ))) + } +} + +/// Decodes a Binary from the value section of a variant. +pub(crate) fn decode_binary(value: &[u8]) -> Result<&[u8], ArrowError> { + let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize; + let value = slice_from_slice(value, 5..5 + len)?; + Ok(value) Review Comment: Surprised clippy didn't complain about this? ```suggestion slice_from_slice(value, 5..5 + len) ``` ########## parquet-variant/src/variant.rs: ########## @@ -451,6 +486,37 @@ impl<'m, 'v> Variant<'m, 'v> { } } + pub fn as_naive_date(self) -> Option<NaiveDate> { + if let Variant::Date(d) = self { + Some(d) + } else { + None + } Review Comment: A bit more idiomatic and compact to match instead of if let? ```suggestion match self { Variant::Date(d) => Some(d), _ => None, } ``` (more below) ########## parquet-variant/src/variant.rs: ########## @@ -461,9 +527,92 @@ impl<'m, 'v> Variant<'m, 'v> { pub fn as_int8(&self) -> Option<i8> { match *self { Variant::Int8(i) => Some(i), - // TODO: Add branches for type-widening/shortening when implemting rest of primitives for int - // Variant::Int16(i) => i.try_into().ok(), - // ... + Variant::Int16(i) => i.try_into().ok(), + Variant::Int32(i) => i.try_into().ok(), Review Comment: I'm not sure what kind of casting we should automatically perform? Different callers could easily have different opinions and use cases. For example: * widening casts (e.g. 10i8 -> 10i32). Infallible. * lossless narrowing casts (e.g. 10i64 -> 10i32) * "mostly lossless" narrowing casts (e.g. decimal 10.000 -> 10i32). The numeric value doesn't change but precision information is lost (e.g. use cases that worry about significant digits treat `10`, `10.` and `10.000` very differently). * lossy narrowing casts (e.g. 3.14f32 -> 3i32) * converting casts (e.g. "10" -> 10i32) * lossy converting casts (e.g. "3.14" -> 3i32) In spark, for example, I believe the standard cast operator includes all but the last case. Other systems might be strict and only allow widening casts, or widening plus lossless narrowing casts. How do we decide where to draw the line, given that there is no universally "correct" answer? ########## parquet-variant/src/variant.rs: ########## @@ -461,9 +527,92 @@ impl<'m, 'v> Variant<'m, 'v> { pub fn as_int8(&self) -> Option<i8> { match *self { Variant::Int8(i) => Some(i), - // TODO: Add branches for type-widening/shortening when implemting rest of primitives for int - // Variant::Int16(i) => i.try_into().ok(), - // ... + Variant::Int16(i) => i.try_into().ok(), + Variant::Int32(i) => i.try_into().ok(), + Variant::Int64(i) => i.try_into().ok(), + _ => None, + } + } + + pub fn as_int16(&self) -> Option<i16> { + match *self { + Variant::Int8(i) => Some(i.into()), + Variant::Int16(i) => Some(i), + Variant::Int32(i) => i.try_into().ok(), + Variant::Int64(i) => i.try_into().ok(), + _ => None, + } + } + pub fn as_int32(&self) -> Option<i32> { + match *self { + Variant::Int8(i) => Some(i.into()), + Variant::Int16(i) => Some(i.into()), + Variant::Int32(i) => Some(i), + Variant::Int64(i) => i.try_into().ok(), + _ => None, + } + } + pub fn as_int64(&self) -> Option<i64> { + match *self { + Variant::Int8(i) => Some(i.into()), + Variant::Int16(i) => Some(i.into()), + Variant::Int32(i) => Some(i.into()), + Variant::Int64(i) => Some(i), + _ => None, + } + } + + pub fn as_decimal_int32(&self) -> Option<(i32, u8)> { Review Comment: If we're trying to match the spec wording, should this be `as_decimal4`? ########## parquet-variant/src/variant.rs: ########## @@ -483,6 +632,57 @@ impl<'m, 'v> From<i8> for Variant<'m, 'v> { } } +impl<'m, 'v> From<i16> for Variant<'m, 'v> { + fn from(value: i16) -> Self { + Variant::Int16(value) + } +} + +impl<'m, 'v> From<i32> for Variant<'m, 'v> { + fn from(value: i32) -> Self { + Variant::Int32(value) + } +} + +impl<'m, 'v> From<i64> for Variant<'m, 'v> { Review Comment: Newer versions of clippy suggest: ```suggestion impl From<i64> for Variant<'_, '_> { ``` (several of these cases) ########## parquet-variant/src/variant.rs: ########## @@ -402,11 +403,21 @@ pub enum Variant<'m, 'v> { // TODO: Add types for the rest of the primitive types, once API is agreed upon Null, Int8(i8), - + Int16(i16), + Int32(i32), + Int64(i64), + Date(NaiveDate), + TimestampMicros(DateTime<Utc>), + TimestampNTZMicros(NaiveDateTime), + Decimal4 { integer: i32, scale: u8 }, Review Comment: The spec calls the integer form of a decimal an "unscaled value." Should we try to capture that in our naming here? ########## parquet-variant/src/variant.rs: ########## @@ -423,9 +434,33 @@ impl<'m, 'v> Variant<'m, 'v> { VariantBasicType::Primitive => match get_primitive_type(header)? { Review Comment: I had wondered the same thing... but there's also an argument to be made that the header is indeed part of the variant value. A missing first byte would be a problem if we ever needed to retrieve the full value. Maybe something to track as the implementation and usage of this API matures? I don't think we have enough information to make a confident decision quite yet? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org