gstvg commented on PR #18921:
URL: https://github.com/apache/datafusion/pull/18921#issuecomment-3651816409
Thanks @shehabgamin. I analyzed the Spark implementation one last time, and
realized I could have done similarly since the beginning... I've already tested
it locally and pretend to push soon, the diff reduced to ~2K LOC, mostly in
new, self contained code with small changes in existing code and without
requiring changes in downstream code. I hope you haven't put time reviewing it
yet because I believe the spark approach is much better and also easier/faster
to review
Basically, for planning, we lazily set the Field in the lambda column
itself, instead of injecting it into a DFSchema
```rust
//logical
struct LambdaColumn {
name: String,
field: Option<FieldRef>,
}
```
And for evaluation, again, we lazily set the value of a lambda column
instead of injecting it into a RecordBatch
```rust
//physical
struct LambdaColumn {
name: String,
field: FieldRef,
value: Option<ColumnarValue>,
}
```
I think I initially ruled that out because other expressions doesn't contain
it's own Field, and this have been discussed before in #12604 *, and didn't
went forward, which the difference that the other expressions Field's naturally
exist in the plan schema, which *now* I believe justifies this difference.
And most importantly, because I had no idea of how much work and downstream
churn injecting the lambdas parameters into the Schema/RecordBatch would cause
:facepalm:
I will push soon and update the PR description, thanks again!
*This is being discussed again in #18845
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]