paleolimbot commented on PR #254:
URL: https://github.com/apache/sedona-db/pull/254#issuecomment-3473335019
> verified that without the "fix" in this PR, they will fail with the same
errors we see earlier.
I may be misunderstanding the tests you added, but they seem to be passing?
I tried to reproduce in Python and I see that metadata makes its way through to
the end:
```python
import sedona.db
sd = sedona.db.connect()
sd.read_parquet(
"submodules/geoarrow-data/ns-water/files/ns-water_water-point_geo.parquet"
).to_view("tbl")
aggr = sd.sql("""SELECT ST_Envelope_Aggr(geometry) as env FROM tbl GROUP BY
"FEAT_CODE" """)
aggr.schema
#> SedonaSchema with 1 field:
#> st_envelope_aggr(tbl.geometry): geometry<Wkb({...})>
aggr.to_arrow_table().schema
#> st_envelope_aggr(tbl.geometry): extension<geoarrow.wkb<WkbType>>
```
> The real fix is to make DataFusion's physical aggregate use the logical
plan's field (with metadata) instead of creating a new one.
Let's fix DataFusion if it is a DataFusion issue...in the meantime you
should be able to use this branch as a workaround? The reproducer I use for
DataFusion issues is:
<details>
```
use std::collections::HashMap;
use datafusion::{
arrow::datatypes::DataType,
logical_expr::{ScalarUDFImpl, Signature, Volatility},
prelude::*,
};
#[tokio::main]
async fn main() {
let ctx = SessionContext::new();
ctx.register_udf(MakeExtension::default().into());
let batches = ctx
.sql("SELECT make_extension('foofy zero') IS NULL as is_null")
.await
.unwrap()
.collect()
.await
.unwrap();
println!("Regular select:");
println!("{:?}", batches[0].schema().field(0));
}
#[derive(Debug)]
struct MakeExtension {
signature: Signature,
}
impl Default for MakeExtension {
fn default() -> Self {
Self {
signature: Signature::user_defined(Volatility::Immutable),
}
}
}
impl ScalarUDFImpl for MakeExtension {
fn as_any(&self) -> &dyn std::any::Any {
self
}
fn name(&self) -> &str {
"make_extension"
}
fn signature(&self) -> &Signature {
&self.signature
}
fn coerce_types(&self, arg_types: &[DataType]) ->
datafusion::error::Result<Vec<DataType>> {
Ok(arg_types.to_vec())
}
fn return_type(&self, _arg_types: &[DataType]) ->
datafusion::error::Result<DataType> {
unreachable!("This shouldn't have been called")
}
fn return_field_from_args(
&self,
args: datafusion::logical_expr::ReturnFieldArgs,
) -> datafusion::error::Result<datafusion::arrow::datatypes::FieldRef> {
Ok(args.arg_fields[0]
.as_ref()
.clone()
.with_metadata(HashMap::from([(
"ARROW:extension:metadata".to_string(),
"foofy.foofy".to_string(),
)]))
.into())
}
fn invoke_with_args(
&self,
args: datafusion::logical_expr::ScalarFunctionArgs,
) -> datafusion::error::Result<datafusion::logical_expr::ColumnarValue> {
Ok(args.args[0].clone())
}
}
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]