mildbyte commented on issue #212:
URL: 
https://github.com/apache/arrow-datafusion/issues/212#issuecomment-1269778100

   Is there any prior art / existing design documents for this either in the 
DataFusion repo or an application that uses DataFusion? We're interested in 
getting this working for https://github.com/splitgraph/seafowl/issues/137. In 
particular:
   
   - a `json_each()`/`json_array_elements()` function has to be set-returning
   - we could implement user-defined set-returning functions: 
https://github.com/apache/arrow-datafusion/issues/1604 
   - we could also go the [DuckDB 
route](https://duckdb.org/docs/extensions/json) of adding a function that turns 
JSON into an Arrow `List` and then relying on `UNNEST`
     - it'll still require implementing `UNNEST`, but then everything can be 
implemented as a List-returning function + `UNNEST`
   
   We could take a stab at implementing this, either as a PR to DataFusion or 
as a separate crate with planner nodes/optimizer rules.
   
   I did some digging into how PostgreSQL implements this and it seems like 
there are two positions for a set-returning function/UNNEST (which is 
implemented as a set-returning function in PG):
   
   ### Inside the `FROM` clause
   
   ```
   postgres=# WITH t AS (SELECT * FROM (VALUES 
       (1, '{11,12,13}'::int[]),
       (2, '{21,22,23}'::int[])) t(k, ks))
   SELECT * FROM t, UNNEST(t.ks);
   
    k |     ks     | unnest 
   ---+------------+--------
    1 | {11,12,13} |     11
    1 | {11,12,13} |     12
    1 | {11,12,13} |     13
    2 | {21,22,23} |     21
    2 | {21,22,23} |     22
    2 | {21,22,23} |     23
   (6 rows)
   
                                QUERY PLAN                             
   --------------------------------------------------------------------
    Nested Loop  (cost=0.00..0.43 rows=20 width=40)
      ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=36)
      ->  Function Scan on unnest  (cost=0.00..0.10 rows=10 width=4)
   ```
   
   In this case it's a 
[NestLoop](https://github.com/postgres/postgres/blob/051b096b8dd3b65f8e4528a8cd24a8702646e8b2/src/backend/executor/nodeNestloop.c#L129-L146)
 that grabs values from the outer loop and passes them as parameters to the 
[FuncScan](https://github.com/postgres/postgres/blob/051b096b8dd3b65f8e4528a8cd24a8702646e8b2/src/backend/executor/nodeFunctionscan.c#L77-L107).
   
   ### As an expression
   
   ```
   postgres=# WITH t AS (SELECT * FROM (VALUES 
       (1, '{11,12,13}'::int[]),
       (2, '{21,22,23}'::int[])) t(k, ks))
   SELECT t.*, UNNEST(t.ks) FROM t;
   
                                QUERY PLAN                             
   --------------------------------------------------------------------
    ProjectSet  (cost=0.00..0.14 rows=20 width=40)
      ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=36)
   (2 rows)
   ```
   
   This turns it into a 
[ProjectSet](https://github.com/postgres/postgres/blob/051b096b8dd3b65f8e4528a8cd24a8702646e8b2/src/backend/executor/nodeProjectSet.c#L67-L117)
 which returns a product `{k: 1, ks: [11, 12, 13], unnest: UNNEST([11, 12, 
13])} -> {k: 1, ks: [11, 12, 13], unnest: 11}, ..., {k: 1, ks: [11, 12, 13], 
unnest: 13}, ...`
   
   ### The way to go?
   
   The `NestLoop` + `FuncScan` way requires parameterized plans, which don't 
yet seem to be supported in DataFusion (from inspecting the `ExecutionPlan` 
trait). I also haven't wrapped my head around the transformations we'd need to 
write to wire the whole thing up. It's possible there's some work that's been 
done with https://github.com/apache/arrow-datafusion/issues/2248 that makes 
this easier to build?
   
   The `ProjectSet` way is easier to figure out. At plan time, we'd inspect 
every `Projection` node for calls to functions that return a `List` and are 
somehow tagged as set-returning.  then we'd replace the `Projection` with a 
`ProjectSet` that "unnests" the List, e.g.
   
   ```
   SELECT t.k, UNNEST(t.ks) ...
   ```
   
   becomes
   
   ```
   Projection([Column("t", "k"), Unnest(Column("t", "ks"))])
   ```
   
   becomes
   
   ```
   ProjectSet(scalarValues=[Column("k")], toUnnest=[Column("t", "ks")])
   ```
   
   Something like
   
   ```
   -- assume t.ks is a string that represents a JSON array 
   -- assume the implementation for json_array_elements has the type
   --   Utf8 -> List[Utf8]
   SELECT t.k, json_array_elements(t.ks) ...
   ```
   
   would become
   
   ```
   Projection([Column("t", "k"), SetReturningUDF(JsonArrayElements, Column("t", 
"ks"))])
   
   or we could do
   
   Projection([Column("t", "k"), Unnest(ScalarUDF(JsonArrayElements, 
Column("t", "ks")))])
   ```
   
   which would become
   
   ```
   ProjectSet(scalarValues=[Column("k")], 
toUnnest=[ScalarUDF(JsonArrayElements, Column("t", "ks"))]])
   ```
   
   We could also modify the `Projection` node itself, but this way we'll 
possibly be able to implement it faster as a separate opt-in DataFusion 
extension (via custom planner nodes and optimizer rules).
   
   This won't let us place set-returning functions in the `FROM` clause, but I 
think the user will be able to emulate it via CTEs:
   
   ```sql
   SELECT
     t.*, json_extract_path(a, '{some, path}')
   FROM t, json_array_elements(t.ks) a
   
   -- becomes
   
   WITH unnested AS (
     SELECT t.*, json_array_elements(t.ks) AS a
     FROM t
   )
   SELECT unnested.*, json_extract_path(unnested.a, '{some, path}')
   FROM unnested
   ```
   
   Any thoughts/advice? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to