jorgecarleitao commented on a change in pull request #9469:
URL: https://github.com/apache/arrow/pull/9469#discussion_r574865169



##########
File path: rust/arrow/src/array/array.rs
##########
@@ -326,6 +327,167 @@ pub fn new_empty_array(data_type: &DataType) -> ArrayRef {
     let data = ArrayData::new_empty(data_type);
     make_array(Arc::new(data))
 }
+/// Creates a new array of `data_type` of length `length` filled entirely of 
`NULL` values
+pub fn new_null_array(data_type: &DataType, length: usize) -> ArrayRef {
+    match data_type {
+        DataType::Null => Arc::new(NullArray::new(length)),
+        DataType::Boolean => {
+            let null_buf: Buffer = MutableBuffer::new_null(length).into();
+            make_array(Arc::new(ArrayData::new(
+                data_type.clone(),
+                length,
+                Some(length),
+                Some(null_buf.clone()),
+                0,
+                vec![null_buf],
+                vec![],
+            )))
+        }
+        DataType::Int8 | DataType::UInt8 => {
+            new_null_sized_array::<Int8Type>(data_type, length)
+        }
+        DataType::Int16 | DataType::UInt16 => {
+            new_null_sized_array::<Int16Type>(data_type, length)
+        }
+        DataType::Float16 => unreachable!(),
+        DataType::Int32
+        | DataType::UInt32

Review comment:
       FWIW, imo we have been doing this incorrectly all along. We currently 
have one `PrimitiveArray` per logical type (e.g. `Date32`, `Int32`) because 
there is a one-to-one correspondence between `ArrowPrimitiveType` and 
`DataType`. This implies that we will never be able to support `Timestamp(_,X)` 
with `PrimitiveArray`, as that type has many variations (one per time zone).
   
   Having one array type per logical type also makes the whole code less 
maintainable, because we have to downcast for every variation of the 
`DataType`, instead of for every variation of the underlying physical type.
   
   IMO we should have 1 `PrimitiveArray` per _physical_ type. I.e. 
`PrimitiveArray<i32>`, `PrimitiveArray<i64>`, etc. The `DataType` just denotes 
how the physical values are to be interpreted (logical), not their layout in 
memory nor any safety considerations (e.g. `&str` vs `&[u8]`).
   
   More generally, a generic parameter `T` represents a variation of the 
underlying physical / memory layout, and we have been using `PrimitiveArray<T>` 
to represent the same layout but different logical representations.
   
   This is one of the main changes in the proposal I wrote, on which a 
primitive array is defined by its physical representation, 
`PrimitiveArray<i64>` and its logical representation (`DataType`), on which the 
datatype is a many to one to a physical representation (e.g. `Date32`, `Time32` 
and `Int32` all map to `i32`).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to