vincev opened a new pull request, #5106: URL: https://github.com/apache/arrow-datafusion/pull/5106
# Which issue does this PR close? Work for #212 # Rationale for this change This PR add an `unnest_column` method to `DataFrame` to unnest list types columns (see [tests](https://github.com/vincev/arrow-datafusion/blob/8a7059cefa2a02a8418a704b8d6ff08aead06fbf/datafusion/core/tests/dataframe.rs#L511)), given the following data frame: ``` "+----------+------------------------------------------------------------+--------------------+" "| shape_id | points | tags |" "+----------+------------------------------------------------------------+--------------------+" "| 1 | [{"x": -3, "y": -4}, {"x": -3, "y": 6}, {"x": 2, "y": -2}] | [tag1] |" "| 2 | | [tag1, tag2] |" "| 3 | [{"x": -9, "y": 2}, {"x": -10, "y": -4}] | |" "| 4 | [{"x": -3, "y": 5}, {"x": 2, "y": -1}] | [tag1, tag2, tag3] |" "+----------+------------------------------------------------------------+--------------------+" ``` The call `df.unnest_column("tags")` produces: ``` +----------+------------------------------------------------------------+------+ | shape_id | points | tags | +----------+------------------------------------------------------------+------+ | 1 | [{"x": -3, "y": -4}, {"x": -3, "y": 6}, {"x": 2, "y": -2}] | tag1 | | 2 | | tag1 | | 2 | | tag2 | | 3 | [{"x": -9, "y": 2}, {"x": -10, "y": -4}] | | | 4 | [{"x": -3, "y": 5}, {"x": 2, "y": -1}] | tag1 | | 4 | [{"x": -3, "y": 5}, {"x": 2, "y": -1}] | tag2 | | 4 | [{"x": -3, "y": 5}, {"x": 2, "y": -1}] | tag3 | +----------+------------------------------------------------------------+------+ ``` calling `df.unnest_column("points")` produces: ``` +----------+---------------------+--------------------+ | shape_id | points | tags | +----------+---------------------+--------------------+ | 1 | {"x": -3, "y": -4} | [tag1] | | 1 | {"x": -3, "y": 6} | [tag1] | | 1 | {"x": 2, "y": -2} | [tag1] | | 2 | | [tag1, tag2] | | 3 | {"x": -9, "y": 2} | | | 3 | {"x": -10, "y": -4} | | | 4 | {"x": -3, "y": 5} | [tag1, tag2, tag3] | | 4 | {"x": 2, "y": -1} | [tag1, tag2, tag3] | +----------+---------------------+--------------------+ ``` and calling `df.unnest_column("points").unnest_column("tags")` produces: ``` +----------+---------------------+------+ | shape_id | points | tags | +----------+---------------------+------+ | 1 | {"x": -3, "y": -4} | tag1 | | 1 | {"x": -3, "y": 6} | tag1 | | 1 | {"x": 2, "y": -2} | tag1 | | 2 | | tag1 | | 2 | | tag2 | | 3 | {"x": -9, "y": 2} | | | 3 | {"x": -10, "y": -4} | | | 4 | {"x": -3, "y": 5} | tag1 | | 4 | {"x": -3, "y": 5} | tag2 | | 4 | {"x": -3, "y": 5} | tag3 | | 4 | {"x": 2, "y": -1} | tag1 | | 4 | {"x": 2, "y": -1} | tag2 | | 4 | {"x": 2, "y": -1} | tag3 | +----------+---------------------+------+ ``` # What changes are included in this PR? This PR add the following changes: - Add `unnest_method` to `DataFrame` - Add an `Unnest` variant to `LogicalPlan` that produces a new schema for the unnested column - Add `UnnestExec` to the execution plan - Add some tests to `DataFrame` # Are these changes tested? Added some initial tests [here](https://github.com/vincev/arrow-datafusion/blob/8a7059cefa2a02a8418a704b8d6ff08aead06fbf/datafusion/core/tests/dataframe.rs#L511), I am happy to add more tests following feedback. # Are there any user-facing changes? Add an `unnest_column` method to `DataFrame`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
