[I] Allow projection of schemas/structs [arrow]

via GitHub Mon, 06 Nov 2023 12:32:28 -0800


Fokko opened a new issue, #38615:
URL: https://github.com/apache/arrow/issues/38615


   ### Describe the enhancement requested
   
   For PyIceberg recently, concatenation of tables has been added: 
https://github.com/apache/arrow/pull/36846 To add new fields I concat the 
requested schema with the data that was loaded. However, now I'm hitting the 
next barrier, unable to project the schemas of nested structs.
   
   Bit of context. For the top-level schema it is not an issue because we can 
select the columns that we need when reading in the table, but it doesn't allow 
selection of nested columns.
   
   Selecting a subset:
   ```
   ➜  Desktop python3
   Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 
(clang-1500.0.40.1)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow as pa
   >>> 
   >>> current_schema = pa.schema([pa.field("x", pa.float32()), pa.field("y", 
pa.float32())])
   >>> tbl = pa.Table.from_pylist(
   ...     [
   ...         {"x": 52.371807, "y": 4.896029},
   ...         {"x": 52.387386, "y": 4.646219},
   ...         {"x": 52.078663, "y": 4.288788},
   ...     ],
   ...     schema=current_schema,
   ... )
   >>> schema_with_z = pa.schema(
   ...     [
   ...         pa.field("x", pa.float32()),
   ...     ]
   ... )
   >>> tbl.cast(schema_with_z)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
   ValueError: Target schema's field names are not matching the table's field 
names: ['x', 'y'], ['x']
   ```
   
   Or in a nested struct:
   ```
   ➜  Desktop python3
   Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 
(clang-1500.0.40.1)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow as pa
   >>> 
   >>> current_schema = pa.schema(
   ...     pa.field(
   ...         "location",
   ...         pa.struct([pa.field("x", pa.float32()), pa.field("y", 
pa.float32())]),
   ...     )
   ... )
   >>> 
   >>> tbl = pa.Table.from_pylist(
   ...     [
   ...         {"location": {"x": 52.371807, "y": 4.896029}},
   ...         {"location": {"x": 52.387386, "y": 4.646219}},
   ...         {"location": {"x": 52.078663, "y": 4.288788}},
   ...     ],
   ...     schema=current_schema,
   ... )
   >>> schema_without_x = pa.schema(
   ...     pa.field(
   ...         "location",
   ...         pa.struct(
   ...             [
   ...                 pa.field("x", pa.float32()),
   ...             ]
   ...         ),
   ...     )
   ... )
   >>> tbl.cast(schema_without_x)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "pyarrow/table.pxi", line 3793, in pyarrow.lib.Table.cast
   ValueError: Target schema's field names are not matching the table's field 
names: ['x', 'y'], ['x']
   ```
   
   Any thoughts on adding this? Or can we achieve this in another way?
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Allow projection of schemas/structs [arrow]

Reply via email to