Re: [PR] feat: support array_insert [datafusion-comet]

via GitHub Thu, 21 Nov 2024 02:05:00 -0800


SemyonSinchenko commented on code in PR #1073:
URL: https://github.com/apache/datafusion-comet/pull/1073#discussion_r1851687830



##########
native/spark-expr/src/list.rs:
##########
@@ -413,14 +426,297 @@ impl PartialEq<dyn Any> for GetArrayStructFields {
     }
 }
 
+#[derive(Debug, Hash)]
+pub struct ArrayInsert {
+    src_array_expr: Arc<dyn PhysicalExpr>,
+    pos_expr: Arc<dyn PhysicalExpr>,
+    item_expr: Arc<dyn PhysicalExpr>,
+    legacy_negative_index: bool,
+}
+
+impl ArrayInsert {
+    pub fn new(
+        src_array_expr: Arc<dyn PhysicalExpr>,
+        pos_expr: Arc<dyn PhysicalExpr>,
+        item_expr: Arc<dyn PhysicalExpr>,
+        legacy_negative_index: bool,
+    ) -> Self {
+        Self {
+            src_array_expr,
+            pos_expr,
+            item_expr,
+            legacy_negative_index,
+        }
+    }
+}
+
+impl PhysicalExpr for ArrayInsert {
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn data_type(&self, input_schema: &Schema) -> DataFusionResult<DataType> {
+        match self.src_array_expr.data_type(input_schema)? {
+            DataType::List(field) => Ok(DataType::List(field)),
+            DataType::LargeList(field) => Ok(DataType::LargeList(field)),
+            data_type => Err(DataFusionError::Internal(format!(
+                "Unexpected data type in ArrayInsert: {:?}",
+                data_type
+            ))),
+        }
+    }
+
+    fn nullable(&self, input_schema: &Schema) -> DataFusionResult<bool> {
+        self.src_array_expr.nullable(input_schema)
+    }
+
+    fn evaluate(&self, batch: &RecordBatch) -> DataFusionResult<ColumnarValue> 
{
+        let pos_value = self
+            .pos_expr
+            .evaluate(batch)?
+            .into_array(batch.num_rows())?;
+
+        // Spark supports only IntegerType (Int32):
+        // 
https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L4737
+        if !matches!(pos_value.data_type(), DataType::Int32) {
+            return Err(DataFusionError::Internal(format!(
+                "Unexpected index data type in ArrayInsert: {:?}, expected 
type is Int32",
+                pos_value.data_type()
+            )));
+        }
+
+        // Check that src array is actually an array and get it's value type
+        let src_value = self
+            .src_array_expr
+            .evaluate(batch)?
+            .into_array(batch.num_rows())?;
+        let src_element_type = match src_value.data_type() {
+            DataType::List(field) => field.data_type(),
+            DataType::LargeList(field) => field.data_type(),
+            data_type => {
+                return Err(DataFusionError::Internal(format!(
+                    "Unexpected src array type in ArrayInsert: {:?}",
+                    data_type
+                )))
+            }

Review Comment:
   @andygrove Thanks for the suggestion!
   I moved a checking of the array type (and the exception logic) to the method:
   ```rs
       pub fn array_type(&self, data_type: &DataType) -> 
DataFusionResult<DataType> {
           match data_type {
               DataType::List(field) => Ok(DataType::List(Arc::clone(field))),
               DataType::LargeList(field) => 
Ok(DataType::LargeList(Arc::clone(field))),
               data_type => {
                   return Err(DataFusionError::Internal(format!(
                       "Unexpected src array type in ArrayInsert: {:?}",
                       data_type
                   )))
               }
           }
       }
   ```
   It allows at least to avoid returning the same error multiple time. Is it 
what you suggested? Or should I move this method to a helper function and 
refactor also `GerArrayStructField` to use such a function?
   
   P.S. Sorry for the stupid question... But can you please explain to me why 
we always check both `List` and `LargeList`, while Apache Spark only supports 
`i32` indexes for arrays (max length is `Integer.MAX_VALUE - 15`), which is the 
case of `List` to my understanding? All the code in the `list.rs` might become 
a bit simpler if we make it non-generic (it also makes implementation of other 
missing methods like `array_zip` simpler).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: support array_insert [datafusion-comet]

Reply via email to