timsaucer commented on PR #17289: URL: https://github.com/apache/datafusion/pull/17289#issuecomment-3218278949
After experimenting a little more I can see two paths forward for supporting cases like this dataframe: ``` +--------------+--------------+ | a | b | +--------------+--------------+ | 0.1111111111 | [1, 2, 3] | | 0.2222222222 | | | | [4, 5, 6, 7] | | 0.4444444444 | [] | +--------------+--------------+ ``` Suppose I wanted to do a `round` call where I am passing column `a` as the value to round and column `b` as the number of decimal places I want to round to. Ultimately I want this to give an output like ``` +--------------+--------------+--------------------+ | a | b | round(a, b[]) | +--------------+--------------+--------------------+ | 0.1111111111 | [1, 2, 3] | [0.1, 0.11, 0.111] | | 0.2222222222 | | | | | [4, 5, 6, 7] | | | 0.4444444444 | [] | [] | +--------------+--------------+--------------------+ ``` A difficulty here is that we need to map the entries of `a` multiple times to the `b`. It appears the best way to do this is to use run end encoding. Then we could keep the `ArrayRef` for the column `a` and create a small primitive array of indices `[3, 4, 8, 9]` that should give us an array that will have the same length as the `values` array of column `b`. I have tested this locally but I run into the problem that the existing scalar functions do not handle run end encoded arrays. All of these functions would need to be implemented, as well as any UDFs that customers create. An alternative way we could do this would be to create a new array for `b` and simply duplicate the data as many times as necessary. This feels like it could lead to excessive memory consumption as we are duplicating values just to feed them into a function and throw them away afterwards. Yet it has the advantage that it would immediately support *all* scalar functions we have with no additional work. I'm a bit torn on this. I also have alternative reasons to wish to have additional REE support throughout DataFusion. So pushing the first approach would lead to long term benefit but have a much longer tail of implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org