gstvg commented on PR #9743:
URL: 
https://github.com/apache/arrow-datafusion/pull/9743#issuecomment-2018869844

   Thank you guys for all the comments and the review. 
   
   Prior to applying the requested changes, I wanna say that I think I didn't 
elaborated enough what this PR does and most importantly, why it does that way, 
and now I'm afraid maybe this is not the best way.
   
   How @alamb noted and @yyy1000 agreed, ideally we would simply extend the 
current struct function to support named arguments, but because of the scalar 
function signature `fn invoke(&self, args: &[ColumnarValue]) -> 
Result<ColumnarValue>`, I didn't figured out any way to associate a name or any 
kind of metadata with the args. Self in this case is the "global" function, not 
unique to any specific invocation. Changing it's signature it's a big change, 
and I personally like it's simplicity. We can add metadata to the 
ColumnarValue, but again, feels like a big change.
   
   We could modify the existing struct func to accept pairs of name and value, 
but it would be breaking change, e.g: given an existing call `struct("text", 
1)` should it create a struct with 2 fields named "c0" and "c1", or a single 
field named "text"?
   
   And most importantly, It would not support the much nicer syntax 
`struct("valueOrName?", "value" as another_name)` because it would be ambiguos, 
is the first arg a value, or the name of the second arg? If it's a name, and 
it's corresponding value pair is a named arg, what name takes precedende?
   
   So, we can create a new separate function, that accept pairs of name and 
value, but not named arguments, and it would give us a minimun level of support 
without a nice syntax, and also naming it `named_struct` it becomes compatible 
with Spark but that's not the priority now, of course.
   
   Then, there's the improvement of supporting the `struct(value as name)` 
syntax.
   
   Currently, this PR, while parsing `sqlparser::Expr::Struct`, inspect the 
args `Expr` and if it is `Expr::Named`, use its name, otherwise the cN 
convetion, then call `named_struct`, not `struct`.
   But doing this, what exaclty  struct(...) becomes ? A function? Expression? 
"Meta-function"? How should the docs call it? Users inspecting projections and 
plans would only see `named_struct` calls, it is acceptable? It is indeed a bit 
confusing.
   Also, if Expr-based code calls `struct` with `datafusion::Expr::Alias`, this 
PR can't identify it and return field names with the cN convetion, which is 
probably unexpected. Should we delete the struct function, so Expr code can't 
use it?
   
   Prior opening this PR, I though this:
   
   > Return the existing struct call wrapped in a Expr::Cast to a struct with 
custom field names, but this would only benefit SQL users.
   
   [I was wrapping the call with a cast while parsing 
`sqlparser::Expr`.](https://github.com/gstvg/arrow-datafusion/commit/1863a696449c9d0076341fcf92276299d77de223)
   
   But if 
[instead](https://github.com/gstvg/arrow-datafusion/commit/98ff23912a260e27304f6e48b74b2e8bd008e7e4)
 we parse `sqlparser::Expr::Named` to `datafusion::Expr::Alias`, then add 
`FunctionRewrite` that wraps `struct` calls with cast, using the name of 
`Expr::Alias`, if any, or the cN convetion, we support both SQL and Expr users 
via the same code in a more expected way, i think.
   The problem with it is that, per current `FunctionRewrite` 
[docs](https://github.com/apache/arrow-datafusion/blob/0b955776d172b4eb304097d8bab0bd82c3d20915/datafusion/expr/src/expr_rewriter/mod.rs#L37)
   
   ```
   Trait for rewriting [`Expr`]s into function calls.
   
   This trait is used with `FunctionRegistry::register_function_rewrite` to
   to evaluating `Expr`s using functions that may not be built in to DataFusion
   
   For example, concatenating arrays `a || b` is represented as
   `Operator::ArrowAt`, but can be implemented by calling a function
   `array_concat` from the `functions-array` crate.
   ```
   we are using it in the opposite way it was designed for. It is okay to use 
it that way, updating the docs, or it's misuse?
   It also create [very long projection plan 
strings](https://github.com/gstvg/arrow-datafusion/commit/98ff23912a260e27304f6e48b74b2e8bd008e7e4#diff-065f63a22dadb8c281756ba1a6354ec6661224df0aae1398912e57259d1c6ceeR66),
 which may be undesired.
   
   So, given the options:
   
   1.  Add `named_struct` function
   1.1. Forward `struct` calls to `named_struct`
   2. Rewrite `struct` into a `Expr::Cast(struct_fn_expr, 
struct_datatype_with_named_fields)`
   
   We can do:
   
   - Only 1 
   - Only 2
   - 1 & 1.1 (This PR)
   - 1 & 2(for spark compatibility only, in a different PR)
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to