gstvg commented on PR #9743:
URL:
https://github.com/apache/arrow-datafusion/pull/9743#issuecomment-2018869844
Thank you guys for all the comments and the review.
Prior to applying the requested changes, I wanna say that I think I didn't
elaborated enough what this PR does and most importantly, why it does that way,
and now I'm afraid maybe this is not the best way.
How @alamb noted and @yyy1000 agreed, ideally we would simply extend the
current struct function to support named arguments, but because of the scalar
function signature `fn invoke(&self, args: &[ColumnarValue]) ->
Result<ColumnarValue>`, I didn't figured out any way to associate a name or any
kind of metadata with the args. Self in this case is the "global" function, not
unique to any specific invocation. Changing it's signature it's a big change,
and I personally like it's simplicity. We can add metadata to the
ColumnarValue, but again, feels like a big change.
We could modify the existing struct func to accept pairs of name and value,
but it would be breaking change, e.g: given an existing call `struct("text",
1)` should it create a struct with 2 fields named "c0" and "c1", or a single
field named "text"?
And most importantly, It would not support the much nicer syntax
`struct("valueOrName?", "value" as another_name)` because it would be ambiguos,
is the first arg a value, or the name of the second arg? If it's a name, and
it's corresponding value pair is a named arg, what name takes precedende?
So, we can create a new separate function, that accept pairs of name and
value, but not named arguments, and it would give us a minimun level of support
without a nice syntax, and also naming it `named_struct` it becomes compatible
with Spark but that's not the priority now, of course.
Then, there's the improvement of supporting the `struct(value as name)`
syntax.
Currently, this PR, while parsing `sqlparser::Expr::Struct`, inspect the
args `Expr` and if it is `Expr::Named`, use its name, otherwise the cN
convetion, then call `named_struct`, not `struct`.
But doing this, what exaclty struct(...) becomes ? A function? Expression?
"Meta-function"? How should the docs call it? Users inspecting projections and
plans would only see `named_struct` calls, it is acceptable? It is indeed a bit
confusing.
Also, if Expr-based code calls `struct` with `datafusion::Expr::Alias`, this
PR can't identify it and return field names with the cN convetion, which is
probably unexpected. Should we delete the struct function, so Expr code can't
use it?
Prior opening this PR, I though this:
> Return the existing struct call wrapped in a Expr::Cast to a struct with
custom field names, but this would only benefit SQL users.
[I was wrapping the call with a cast while parsing
`sqlparser::Expr`.](https://github.com/gstvg/arrow-datafusion/commit/1863a696449c9d0076341fcf92276299d77de223)
But if
[instead](https://github.com/gstvg/arrow-datafusion/commit/98ff23912a260e27304f6e48b74b2e8bd008e7e4)
we parse `sqlparser::Expr::Named` to `datafusion::Expr::Alias`, then add
`FunctionRewrite` that wraps `struct` calls with cast, using the name of
`Expr::Alias`, if any, or the cN convetion, we support both SQL and Expr users
via the same code in a more expected way, i think.
The problem with it is that, per current `FunctionRewrite`
[docs](https://github.com/apache/arrow-datafusion/blob/0b955776d172b4eb304097d8bab0bd82c3d20915/datafusion/expr/src/expr_rewriter/mod.rs#L37)
```
Trait for rewriting [`Expr`]s into function calls.
This trait is used with `FunctionRegistry::register_function_rewrite` to
to evaluating `Expr`s using functions that may not be built in to DataFusion
For example, concatenating arrays `a || b` is represented as
`Operator::ArrowAt`, but can be implemented by calling a function
`array_concat` from the `functions-array` crate.
```
we are using it in the opposite way it was designed for. It is okay to use
it that way, updating the docs, or it's misuse?
It also create [very long projection plan
strings](https://github.com/gstvg/arrow-datafusion/commit/98ff23912a260e27304f6e48b74b2e8bd008e7e4#diff-065f63a22dadb8c281756ba1a6354ec6661224df0aae1398912e57259d1c6ceeR66),
which may be undesired.
So, given the options:
1. Add `named_struct` function
1.1. Forward `struct` calls to `named_struct`
2. Rewrite `struct` into a `Expr::Cast(struct_fn_expr,
struct_datatype_with_named_fields)`
We can do:
- Only 1
- Only 2
- 1 & 1.1 (This PR)
- 1 & 2(for spark compatibility only, in a different PR)
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]