mildbyte commented on issue #212: URL: https://github.com/apache/arrow-datafusion/issues/212#issuecomment-1269778100
Is there any prior art / existing design documents for this either in the DataFusion repo or an application that uses DataFusion? We're interested in getting this working for https://github.com/splitgraph/seafowl/issues/137. In particular: - a `json_each()`/`json_array_elements()` function has to be set-returning - we could implement user-defined set-returning functions: https://github.com/apache/arrow-datafusion/issues/1604 - we could also go the [DuckDB route](https://duckdb.org/docs/extensions/json) of adding a function that turns JSON into an Arrow `List` and then relying on `UNNEST` - it'll still require implementing `UNNEST`, but then everything can be implemented as a List-returning function + `UNNEST` We could take a stab at implementing this, either as a PR to DataFusion or as a separate crate with planner nodes/optimizer rules. I did some digging into how PostgreSQL implements this and it seems like there are two positions for a set-returning function/UNNEST (which is implemented as a set-returning function in PG): ### Inside the `FROM` clause ``` postgres=# WITH t AS (SELECT * FROM (VALUES (1, '{11,12,13}'::int[]), (2, '{21,22,23}'::int[])) t(k, ks)) SELECT * FROM t, UNNEST(t.ks); k | ks | unnest ---+------------+-------- 1 | {11,12,13} | 11 1 | {11,12,13} | 12 1 | {11,12,13} | 13 2 | {21,22,23} | 21 2 | {21,22,23} | 22 2 | {21,22,23} | 23 (6 rows) QUERY PLAN -------------------------------------------------------------------- Nested Loop (cost=0.00..0.43 rows=20 width=40) -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=36) -> Function Scan on unnest (cost=0.00..0.10 rows=10 width=4) ``` In this case it's a [NestLoop](https://github.com/postgres/postgres/blob/051b096b8dd3b65f8e4528a8cd24a8702646e8b2/src/backend/executor/nodeNestloop.c#L129-L146) that grabs values from the outer loop and passes them as parameters to the [FuncScan](https://github.com/postgres/postgres/blob/051b096b8dd3b65f8e4528a8cd24a8702646e8b2/src/backend/executor/nodeFunctionscan.c#L77-L107). ### As an expression ``` postgres=# WITH t AS (SELECT * FROM (VALUES (1, '{11,12,13}'::int[]), (2, '{21,22,23}'::int[])) t(k, ks)) SELECT t.*, UNNEST(t.ks) FROM t; QUERY PLAN -------------------------------------------------------------------- ProjectSet (cost=0.00..0.14 rows=20 width=40) -> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=36) (2 rows) ``` This turns it into a [ProjectSet](https://github.com/postgres/postgres/blob/051b096b8dd3b65f8e4528a8cd24a8702646e8b2/src/backend/executor/nodeProjectSet.c#L67-L117) which returns a product `{k: 1, ks: [11, 12, 13], unnest: UNNEST([11, 12, 13])} -> {k: 1, ks: [11, 12, 13], unnest: 11}, ..., {k: 1, ks: [11, 12, 13], unnest: 13}, ...` ### The way to go? The `NestLoop` + `FuncScan` way requires parameterized plans, which don't yet seem to be supported in DataFusion (from inspecting the `ExecutionPlan` trait). I also haven't wrapped my head around the transformations we'd need to write to wire the whole thing up. It's possible there's some work that's been done with https://github.com/apache/arrow-datafusion/issues/2248 that makes this easier to build? The `ProjectSet` way is easier to figure out. At plan time, we'd inspect every `Projection` node for calls to functions that return a `List` and are somehow tagged as set-returning. then we'd replace the `Projection` with a `ProjectSet` that "unnests" the List, e.g. ``` SELECT t.k, UNNEST(t.ks) ... ``` becomes ``` Projection([Column("t", "k"), Unnest(Column("t", "ks"))]) ``` becomes ``` ProjectSet(scalarValues=[Column("k")], toUnnest=[Column("t", "ks")]) ``` Something like ``` -- assume t.ks is a string that represents a JSON array -- assume the implementation for json_array_elements has the type -- Utf8 -> List[Utf8] SELECT t.k, json_array_elements(t.ks) ... ``` would become ``` Projection([Column("t", "k"), SetReturningUDF(JsonArrayElements, Column("t", "ks"))]) or we could do Projection([Column("t", "k"), Unnest(ScalarUDF(JsonArrayElements, Column("t", "ks")))]) ``` which would become ``` ProjectSet(scalarValues=[Column("k")], toUnnest=[ScalarUDF(JsonArrayElements, Column("t", "ks"))]]) ``` We could also modify the `Projection` node itself, but this way we'll possibly be able to implement it faster as a separate opt-in DataFusion extension (via custom planner nodes and optimizer rules). This won't let us place set-returning functions in the `FROM` clause, but I think the user will be able to emulate it via CTEs: ```sql SELECT t.*, json_extract_path(a, '{some, path}') FROM t, json_array_elements(t.ks) a -- becomes WITH unnested AS ( SELECT t.*, json_array_elements(t.ks) AS a FROM t ) SELECT unnested.*, json_extract_path(unnested.a, '{some, path}') FROM unnested ``` Any thoughts/advice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
