parthchandra commented on code in PR #1034:
URL: https://github.com/apache/datafusion-comet/pull/1034#discussion_r1824849253


##########
native/core/src/execution/shuffle/row.rs:
##########
@@ -235,6 +250,143 @@ impl SparkUnsafeRow {
         }
     }
 
+    #[allow(clippy::needless_range_loop)]
+    pub fn get_rows_from_arrays(
+        schema: Vec<ArrowDataType>,
+        arrays: Vec<ArrayRef>,
+        num_rows: usize,
+        num_cols: usize,
+        addr: usize,
+    ) {
+        let mut row_start_addr: usize = addr;
+        for i in 0..num_rows {
+            let mut row = SparkUnsafeRow::new(&schema);
+            let row_size = SparkUnsafeRow::get_row_bitset_width(schema.len()) 
+ 8 * num_cols;

Review Comment:
   I moved it out. However with variable length types, every row will be 
different in size so this will come back in. 



##########
native/core/src/execution/shuffle/row.rs:
##########
@@ -235,6 +250,143 @@ impl SparkUnsafeRow {
         }
     }
 
+    #[allow(clippy::needless_range_loop)]
+    pub fn get_rows_from_arrays(
+        schema: Vec<ArrowDataType>,
+        arrays: Vec<ArrayRef>,
+        num_rows: usize,
+        num_cols: usize,
+        addr: usize,
+    ) {
+        let mut row_start_addr: usize = addr;
+        for i in 0..num_rows {
+            let mut row = SparkUnsafeRow::new(&schema);
+            let row_size = SparkUnsafeRow::get_row_bitset_width(schema.len()) 
+ 8 * num_cols;
+            unsafe {
+                row.point_to_slice(std::slice::from_raw_parts(
+                    row_start_addr as *const u8,
+                    row_size,
+                ));
+            }
+            row_start_addr += row_size;
+            for j in 0..num_cols {
+                let arr = arrays.get(j).unwrap();
+                let dt = &schema[j];
+                // assert_eq!(dt, arr.data_type());
+                match dt {

Review Comment:
   This inner loop is the performance bottleneck. I moved the null check out of 
the match. It made a negligible difference. I also spent a bunch of time trying 
to eliminate the datatype check for every row and column, but the compiler 
wouldn't let me. Essentially, for every column I need an `ArrayAccessor` trait 
object that I can save in a vector so that I can call the `value(i)` (or better 
still `value_unchecked` method. I couldn't get the compiler to let me though.  
Even if it did, this would make calls to the `value` method use dynamic 
dispatch and most likely be slower than the match on datatype. 



##########
native/core/src/execution/shuffle/row.rs:
##########
@@ -271,6 +423,51 @@ impl SparkUnsafeRow {
             *word_offset = word & !mask;
         }
     }
+
+    #[inline]
+    pub fn set_null_at(&mut self, index: usize) {
+        unsafe {
+            let mask: i64 = 1i64 << (index & 0x3f);
+            let word_offset = (self.row_addr + (((index >> 6) as i64) << 3)) 
as *mut i64;
+            let word: i64 = *word_offset;
+            *word_offset = word | mask;
+        }
+    }
+
+    #[inline]
+    pub fn set_boolean(&mut self, index: usize, value: bool) {

Review Comment:
   I can create a PR but I don't see the usefulness of doing so. It's looking 
like creating an `UnsafeRow` in native is a slower approach in general, so best 
avoided.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to