nevi-me commented on a change in pull request #8170:
URL: https://github.com/apache/arrow/pull/8170#discussion_r489763949



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -425,22 +514,32 @@ mod tests {
     #[test]
     fn test_take_string() {
         let index = UInt32Array::from(vec![Some(3), None, Some(1), Some(3), 
Some(4)]);
-        let mut builder: StringBuilder = StringBuilder::new(6);
-        builder.append_value("one").unwrap();
-        builder.append_null().unwrap();
-        builder.append_value("three").unwrap();
-        builder.append_value("four").unwrap();
-        builder.append_value("five").unwrap();
-        let array = Arc::new(builder.finish()) as ArrayRef;
-        let a = take(&array, &index, None).unwrap();
-        assert_eq!(a.len(), index.len());
-        builder.append_value("four").unwrap();
-        builder.append_null().unwrap();
-        builder.append_null().unwrap();
-        builder.append_value("four").unwrap();
-        builder.append_value("five").unwrap();
-        let b = builder.finish();
-        assert_eq!(a.data(), b.data());
+
+        let array = StringArray::from(vec![
+            Some("one"),
+            None,
+            Some("three"),
+            Some("four"),
+            Some("five"),
+        ]);
+        let array = Arc::new(array) as ArrayRef;
+
+        let actual = take(&array, &index, None).unwrap();
+        assert_eq!(actual.len(), index.len());
+
+        let actual = actual.as_any().downcast_ref::<StringArray>().unwrap();
+
+        let expected =
+            StringArray::from(vec![Some("four"), None, None, Some("four"), 
Some("five")]);
+
+        for i in 0..index.len() {

Review comment:
       I got `[25, 0, 0, 0, 0]` instead of `[25, 0, 0, 0]`. Changing the string 
take to 
(https://github.com/jorgecarleitao/arrow/pull/7/commits/02eeb1a01906d1649fb00e723770dabd91874fd3)
 seems to fix the issue.
   
   It surfaces a soundness hole in how we create StringArray and 
LargeStringArray from ArrayRef.  When creating these 2 arrays, we only check 
that we have the right number of buffers, but we do not check the datatype or 
even whether the offsets are the correct size (32 vs 64 bit). If you look at 
`make_array`, you can see that if we pass it `DataType::Utf8` when we should be 
passing `DataType::LargeUtf8`, it will happily create a `StringArray` where we 
want a `LargeStringArray`.
   
   I'll open a JIRA for this issue




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to