Re: [I] `parquet_path_to_id_mapping` generates incorrect path for List types [iceberg-python]

via GitHub Wed, 08 May 2024 11:30:44 -0700


cgbur commented on issue #716:
URL: https://github.com/apache/iceberg-python/issues/716#issuecomment-2101184351


   Here is a complete example recreating the error. Here I am using polars to 
make the table which results in the same schema that I am producing with 
pyarrow. 
   
   ```python
   import polars as pl
   from pyiceberg.catalog.sql import SqlCatalog
   import pyarrow.parquet as pq
   import os
   import shutil
   
   pl.DataFrame(
       {
           "a": [[{"a": 1}, {"a": 2}], [{"a": 3}]],
       }
   ).write_parquet("example.parquet")
   
   warehouse_path = "/tmp/warehouse"
   
   # wipe the warehouse
   if os.path.exists(warehouse_path):
       shutil.rmtree(warehouse_path)
   
   os.makedirs(warehouse_path)
   
   catalog = SqlCatalog(
       "default",
       **{
           "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
           "warehouse": f"file://{warehouse_path}",
       },
   )
   df = pq.read_table("example.parquet")
   catalog.create_namespace("default")
   table = catalog.create_table(
       "default.webserver",
       schema=df.schema,
   )
   table.add_files(["example.parquet"])
   ```
   And here is the error. The top two lines were debug statements showing how 
the mapping file has the incorrect path.
   ```python
       print(f"column mappings {len(parquet_column_mapping)}")
       print(parquet_column_mapping)
   ```
   ```
   column mappings 1
   {'a.list.element.a': 3}
   Traceback (most recent call last):
     File "/home/cgbur/pyice-test/failure.py", line 36, in <module>
       table.add_files(["example.parquet"])
     File 
"/local/home/cgbur/pyice-test/iceberg-python/pyiceberg/table/__init__.py", line 
1355, in add_files
       tx.add_files(file_paths=file_paths, 
snapshot_properties=snapshot_properties)
     File 
"/local/home/cgbur/pyice-test/iceberg-python/pyiceberg/table/__init__.py", line 
462, in add_files
       for data_file in data_files:
     File 
"/local/home/cgbur/pyice-test/iceberg-python/pyiceberg/table/__init__.py", line 
2737, in _parquet_files_to_data_files
       yield from parquet_files_to_data_files(io=io, 
table_metadata=table_metadata, file_paths=iter(file_paths))
     File 
"/local/home/cgbur/pyice-test/iceberg-python/pyiceberg/io/pyarrow.py", line 
1869, in parquet_files_to_data_files
       statistics = data_file_statistics_from_parquet_metadata(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/local/home/cgbur/pyice-test/iceberg-python/pyiceberg/io/pyarrow.py", line 
1734, in data_file_statistics_from_parquet_metadata
       field_id = parquet_column_mapping[column.path_in_schema]
                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
   KeyError: 'a.list.item.a'
   ```
   
   You can see how the parquet `path_in_schema` using the `item` instead of 
`element`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] `parquet_path_to_id_mapping` generates incorrect path for List types [iceberg-python]

Reply via email to