jorgecarleitao opened a new pull request #9506:
URL: https://github.com/apache/arrow/pull/9506
Be warned, this is one of the largest PRs I am proposing thus far. I am
placing here as a placeholder as it significantly affects many.
This is WIP and the whole justification is found below. The code does not
compile as there are many changes needed for this to work. I also temporarily
removed some code until I can place it again.
# Background
Most formats have two main "planes", the logical and the physical, that
represent two fundamentally different aspects of a format. Typically, a
physical type, such as `i64`, represents many logical types (`Int64`, `Date64`,
`Timestamp()`, etc.), and a logical type is uniquely represented by a physical
type.
From the memory's perspective, the physical plane is the only plane that
matters. From the representations' point of view (e.g. as a `string`), the
logical plane matters. Allocate, transmute, the exact reg call all pertain to
the physical plane. Operations such as `toString`, which physical operation `a
+ b` represents, etc, pertain to the logical plane.
Generics are Rust's mechanism to write less code that relies on different
physical representations.
Unfortunately, our dear arrow crate currently uses the type `trait
ArrowPrimitiveType` to differentiate _logical_ representations.
This trait is implemented for zero-sized empty `struct`s that contain both
the logical and physical representation. For example,
```rust
#[derive(Debug)]
pub struct Date32 {}
impl ArrowPrimitiveType for Date32 {
type Native = i32;
const DATA_TYPE: DataType = DataType::Date32;
}
```
We have such structs for 25 logical types (e.g. `Date32`,
`Timestamp(TimestampMillisecondType, _)`), and these are used to connect a
logical type with a physical type.
However, this has many undesirable consequences, the two most notable ones
being:
1. we can't write generics that are valid for different logical types with
the same physical representation.
2. we do not know what to rely on to get an array's logical data type
### 1.
In almost all kernels that support a Primitive type, we write something like
the following:
```rust
match data_type {
...
DataType::Int32 => downcast_take!(Int32Type, values, indices),
DataType::Date32 => downcast_take!(Date32Type, values, indices),
DataType::Date64 => downcast_take!(Date64Type, values, indices),
DataType::Time32(Second) => downcast_take!(Time32SecondType, values,
indices),
DataType::Time32(Millisecond) =>
downcast_take!(Time32MillisecondType, values, indices),
DataType::Time64(Microsecond) =>
downcast_take!(Time64MillisecondType, values, indices),
DataType::Timestamp(Millisecond, _) =>
downcast_take!(TimestampMillisecondType, values, indices),
...
}
```
Note how there are two physical representations of all the types above:
`i32` and `i64`. I.e. from the physical representations' point of view, there
are only two branches there.
Note also that for `Timestamp`, we don't support timestamps with timezones,
as they do not have a corresponding Rust type.
What happens beneath the `downcast_take` and many other downcasts is that we
downcast to the logical type (e.g. `Date32`, and then perform the operation on
it using the same physical type as any other physical type.
The main problem with these is that there is a proliferation of match cases
consequent of us using Rust's physical type system to represent logical types.
This becomes really painful in any kernel that matches two DataTypes, as we
must outline `NxN` of the valid types. The most dramatic example is `cast`, but
any sufficiently generic kernel suffers from this.
### 2.
Currently, we have two ways of getting an arrays' datatype inside a generic
that consumes `array: PrimitiveArray<T: ArrowPrimitiveType>`: `T::DATA_TYPE`
and `array.data_type`. The core issue is that because `DataType` is a _value_,
not a type (because it is only known at runtime), there can be a mismatch
between them. We even abuse it to DRY with things like
```rust
DataType::Timestamp(_, _) => downcast_something!(Int64Type,
data_type)
```
because we know that the physical representation is correct and thus do not
bother writing all cases.
The problem with all of this is that because `ArrayData::data_type()` is
used to transmute byte buffers, e.g. whenever we pass it to `make_array()` to
create an `Arc<dyn Array>`, we expose ourselves to a large risk of unsound code.
# This PR
This PR proposes that we remove `ArrowPrimitiveArray` altogether and have as
the only types passed to `PrimitiveArray<T>` rust's primitive types, such as
`i32`, that implement `ArrowNativeType`.
This way, we stop having to downcast every single logical type, and instead
only have to downcast physical types. This also makes it obvious that `<T>`
corresponds solely to a physical type, and that logical types are purely
decided by `ArrayData::data_type()` at runtime.
This will also allow us to more easily perform runtime checks to whether the
DataType passed to a physical type is correct, which will allow us to write
`unsafe` free Arrays (more details
[here](https://github.com/jorgecarleitao/arrow2/blob/57a129dd8778a4a4bf3d949840aa5fe278f5af4d/README.md)).
## backward incompatibility
There are two main consequences of this change:
## collect needs re-write
We will no longer be able to write
```rust
let array = iter.collect::<PrimitiveArray<Int32Type>>();
```
This is because `Int32Type` will no longer exist and an iterator of
`Option<T>` no longer has a Logical type associated to it.
My proposal to address this is to use a transient struct for this, that I
already concluded addresses it. The corresponding code for the above would be
```rust
let array = iter.collect::<Primitive<i32>>().to(DataType::Int32);
```
Here, `Primitive<i32>` is a struct with a `validity` and `values` buffer
but without a `Datatype`, and `.to` converts it to a `PrimitiveArray<i32>`
(physical type) with a logical type `DataType::Int32` (that panics if the
`DataType` does not fit the physical type).
This has a bit more characters to write. OTOH, it enables full support for
`Timestamp` with timezones, as they can be created via
```rust
let array = iter.collect::<Primitive<i64>>().to(DataType::Timestamp(...));
```
and thus all our iterators' APIs start to work with it out of the box (they
currently do not).
## match cases can be trimmed down
All match cases whose physical representation is the same and whose logical
representation induces no semantic change to the kernel can be collapse in a
single case. They also become a bit easier to understand:
```rust
DataType::Int32 | DataType::Date32 | DataType::Time32(_) =>
downcast_take!(i32, values, indices),
DataType::Int64 | DataType::Date64 | DataType::Time64(_) |
DataType::Timestamp(_) => downcast_take!(i64, values, indices),
```
for kernels whose logic depends on the logical type, they are also greatly
simplified. For example, what our current master currently accomplishes in 16
cases (we use an `unsafe` trick to reduce it down to 3 cases that does not work
if we would do the proper safety checks) can be accomplished as follows:
```rust
(Timestamp(from_unit, None), Timestamp(to_unit, None)) => {
let array = array
.as_any()
.downcast_ref::<PrimitiveArray<i64>>()
.unwrap();
let from_size = time_unit_multiple(&from_unit);
let to_size = time_unit_multiple(&to_unit);
// we either divide or multiply, depending on size of each unit
let array = if from_size >= to_size {
unary::<_, _, i64>(&array, |x| (x / (from_size / to_size)),
to_type)
} else {
unary::<_, _, i64>(&array, |x| (x * (to_size / from_size)),
to_type)
};
Ok(Arc::new(array))
}
```
Note how because the physical type is the same for all logical types, we
perform 1 downcast to `i64` (physical representation), and handle logical
variations at runtime (the factor to multiply/divide).
## Replace `IntXXType` by `iX`, `UIntXXType` by `uX`
With this change, we no longer need Rust types to represent array types. We
may want to keep `pub type Int32Type = i32;` for the sake of a smaller number
of changes on downstream dependencies, but the only downcasts only happen when
there is a different physical, not logical, representation of the data.
This would significantly reduce the number of branches in all DataFusion's
kernels, as well as reduce the risk of errors and unsound code resulting from
those errors.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]