jayzhan211 commented on issue #11513:
URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2247822197
> > But we actually have List(Uint8) inside arrow::Schema, so we can decode
the type with List(UInt8). Therefore, we have UserDefinedType(Utf8) in the
logical planning step, and List(Uint8) in the physical planning step
>
> Makes sense and I like this approach. The only thing I'm still not
understanding is how a type source (like a TableProvider) would communicate to
the logical layer that the `List(UInt8)` is actually a logical `Utf8`.
>
> > The most importance thing is that we could keep DFSchema mostly the same
as it is now. But we introduce a bunch of new LogicalType and Trait for
UserDefinedType to built around it.
>
> This would be tremendously beneficial to this proposal :)
I think we could register the type mapping to the session state like what we
register function. We define the relation from Arrow DataType to LogicalType
only. We then could lookup the type relation we have and figure out the
`List(UInt8)`'s logical type is `Utf8`.
```rust
pub enum LogicalType {
Utf8,
Float,
FixedSizeList(Box<LogicalType>, usize),
}
pub trait TypeRelation {
pub fn as_any(&self) -> &dyn Any {
self
}
pub fn get_logical_type(&self, data_type: &DataType) ->
Result<LogicalType> {
_not_impl_err!("nyi")
}
}
pub struct ListOfU8AsStringType {
}
impl TypeRelation for ListOfU8AsStringType {
fn get_logical_type(&self, data_type: &DataType) -> Result<LogicalType> {
match data_type {
DataType::List(field) if field.data_type() == &DataType::Utf8 =>
{
Ok(LogicalType::Utf8)
}
_ => _not_impl_err!("nyi")
}
}
}
pub struct GeoType {
}
impl TypeRelation for GeoType {
fn get_logical_type(&self, data_type: &DataType) -> Result<LogicalType> {
match data_type {
DataType::FixedSizeList(field, 2) if field.data_type() ==
&DataType::Float32 => {
Ok(LogicalType::FixedSizeList(Box::new(LogicalType::Float),
2))
}
_ => _not_impl_err!("nyi")
}
}
}
pub struct DatafusionBuiltinType {
}
impl TypeRelation for DatafusionBuiltinType {
fn get_logical_type(&self, data_type: &DataType) -> Result<LogicalType> {
match data_type {
DataType::Utf8View => {
Ok(LogicalType::Utf8)
}
_ => _not_impl_err!("nyi")
}
}
}
// Ideally function in logical layer care about LogicalType only, function
in physical layer care about ArrowDataType only.
fn any_function(udt: Arc<dyn TypeRelation>, data_type: DataType) ->
Result<()> {
let logical_type = udt.get_logical_type(&data_type)?;
match data_type {
LogicalType::FixedSizeList(inner_type, size) => {
Ok(())
}
_ => todo!()
}
}
impl SessionState {
fn register_type_relation(&self) -> Result<()> {
// Similar to the idea of register function, so we can have
user-defined type relation (mapping).
self.register(DatafusionBuiltinType)
self.register(ListOfU8AsStringType)
self.register(GeoType)
Ok(())
}
}
impl TableProvider for MyTable {
fn some_func(&self, state: CatalogSessions) {
let udt: Arc<dyn TypeRelation> = self.get_type_relation();
let dfschema = DFSchema::empty();
let expr: Expr;
let data_type: arrow::DataType = expr.get_type(&dfschema);
any_funcion(udf, data_type)
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]