Fokko opened a new issue, #38615: URL: https://github.com/apache/arrow/issues/38615
### Describe the enhancement requested For PyIceberg recently, concatenation of tables has been added: https://github.com/apache/arrow/pull/36846 To add new fields I concat the requested schema with the data that was loaded. However, now I'm hitting the next barrier, unable to project the schemas of nested structs. Bit of context. For the top-level schema it is not an issue because we can select the columns that we need when reading in the table, but it doesn't allow selection of nested columns. Selecting a subset: ``` ➜ Desktop python3 Python 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> >>> current_schema = pa.schema([pa.field("x", pa.float32()), pa.field("y", pa.float32())]) >>> tbl = pa.Table.from_pylist( ... [ ... {"x": 52.371807, "y": 4.896029}, ... {"x": 52.387386, "y": 4.646219}, ... {"x": 52.078663, "y": 4.288788}, ... ], ... schema=current_schema, ... ) >>> schema_with_z = pa.schema( ... [ ... pa.field("x", pa.float32()), ... ] ... ) >>> tbl.cast(schema_with_z) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x'] ``` Or in a nested struct: ``` ➜ Desktop python3 Python 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow as pa >>> >>> current_schema = pa.schema( ... pa.field( ... "location", ... pa.struct([pa.field("x", pa.float32()), pa.field("y", pa.float32())]), ... ) ... ) >>> >>> tbl = pa.Table.from_pylist( ... [ ... {"location": {"x": 52.371807, "y": 4.896029}}, ... {"location": {"x": 52.387386, "y": 4.646219}}, ... {"location": {"x": 52.078663, "y": 4.288788}}, ... ], ... schema=current_schema, ... ) >>> schema_without_x = pa.schema( ... pa.field( ... "location", ... pa.struct( ... [ ... pa.field("x", pa.float32()), ... ] ... ), ... ) ... ) >>> tbl.cast(schema_without_x) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast ValueError: Target schema's field names are not matching the table's field names: ['x', 'y'], ['x'] ``` Any thoughts on adding this? Or can we achieve this in another way? ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
