jecsand838 commented on code in PR #8433:
URL: https://github.com/apache/arrow-rs/pull/8433#discussion_r2376565558


##########
arrow-avro/src/codec.rs:
##########
@@ -685,6 +686,8 @@ pub enum Codec {
     Interval,
     /// Represents Avro union type, maps to Arrow's Union data type
     Union(Arc<[AvroDataType]>, UnionFields, UnionMode),
+    /// Represents an Avro long with an `arrowDurationUnit` metadata property. 
Maps to Arrow's Duration(TimeUnit) data type.
+    Duration(TimeUnit),

Review Comment:
   I actually think there's a different angle to view this from. 
   
   ### From the Avro spec:
   
   > The duration logical type represents an amount of time defined by a number 
of **months, days and milliseconds**. This is not equivalent to a number of 
milliseconds, because, depending on the moment in time from which the duration 
is measured, the number of days in the month and number of milliseconds in a 
day may differ. Other standard periods such as years, quarters, hours and 
minutes can be expressed through these basic periods.
   > 
   > A duration logical type annotates Avro fixed type of size 12, which stores 
three little-endian unsigned integers that represent durations at different 
granularities of time. The first stores a number in months, the second stores a 
number in days, and the third stores a number in milliseconds.
   
   ### From Arrow DataTypes
   
   ```rust
       /// Measure of elapsed time in either seconds, milliseconds, 
microseconds or nanoseconds.
       Duration(TimeUnit),
       /// A "calendar" interval which models types that don't necessarily
       /// have a precise duration without the context of a base timestamp (e.g.
       /// days can differ in length during day light savings time transitions).
       Interval(IntervalUnit),
   ```
   
   ---
   
   I wonder if the approach to take here is mapping an Avro `Duration` to 
either `DataType::Duration(TimeUnit)` or `DataType::Interval(IntervalUnit)` 
based on it's properties and maybe the presence of a metadata field.
   
   Perhaps it would look something like this?
   
   ```suggestion
       Interval(TimeUnit, IntervalUnit),
       /// Represents Avro union type, maps to Arrow's Union data type
       Union(Arc<[AvroDataType]>, UnionFields, UnionMode),
   ```



##########
arrow-avro/src/reader/mod.rs:
##########
@@ -4747,6 +4749,52 @@ mod test {
         }
     }
 
+    #[test]
+    fn test_long_with_duration_annotation() {
+        let _avro_schema_json = r#"
+        {
+            "type": "record",
+            "name": "TestEvents",
+            "fields": [
+                {
+                    "name": "event_duration",
+                    "type": "long",
+                    "arrowDurationUnit": "millisecond"
+                }
+            ]
+        }
+        "#;
+
+        let values = DurationMillisecondArray::from(vec![1000i64, 2000, 3000]);
+        let arrow_field = Field::new(
+            "event_duration",
+            DataType::Duration(TimeUnit::Millisecond),
+            false,
+        );
+        let arrow_schema = Arc::new(Schema::new(vec![arrow_field]));
+        let expected = RecordBatch::try_new(
+            arrow_schema.clone(),
+            vec![Arc::new(values.clone()) as ArrayRef],
+        )
+        .unwrap();
+
+        // Write to in-memory OCF using the Avro writer
+        let buffer = Vec::<u8>::new();
+        let mut writer = AvroWriter::new(buffer, 
(*arrow_schema).clone()).unwrap();
+        writer.write(&expected).unwrap();
+        writer.finish().unwrap();
+        let bytes = writer.into_inner();
+
+        let mut reader = ReaderBuilder::new()
+            .build(Cursor::new(bytes))
+            .expect("build reader for in-memory OCF");
+        let out_schema = reader.schema();
+        let batches = reader.collect::<Result<Vec<_>, _>>().unwrap();
+        let actual = arrow::compute::concat_batches(&out_schema, 
&batches).unwrap();
+
+        assert_eq!(actual, expected);
+    }

Review Comment:
   I'm thinking we'll need to create a new test Avro file containing all the 
different Duration scenarios for the decoder's E2E tests imo. 
   
   You can follow the approach I've used for the others.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to