[I] Different behavior in datafusion 35.0.0 in reading hive-partitioned parquet data [arrow-datafusion-python]

via GitHub Tue, 06 Feb 2024 13:28:51 -0800


jwimberl opened a new issue, #579:
URL: https://github.com/apache/arrow-datafusion-python/issues/579


   **Describe the bug**
   `pip` recently switched to installing datafusion with version string 
`'35.0.0'`. Compared to a previous installation of version `'34.0.0'`, creating 
an external table from hive-partitioned parquet data following the 
[https://arrow.apache.org/datafusion/user-guide/sql/ddl.html](documented 
instructions) does not work. While all the partition columns show up as columns 
of the table, the columns from the parquet data  themselves do not appear.
   
   **To Reproduce**
   ```
   # prepare fake data
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pandas as pd
   data = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
   table = pa.Table.from_pandas(data)
   import os
   os.mkdir("fake=0")
   pq.write_table(table,"./fake=0/data.parquet")
   
   # load into datafusion
   import datafusion as df
   ctx = df.SessionContext()
   ctx.sql("""
   CREATE EXTERNAL TABLE data
   STORED AS PARQUET
   PARTITIONED BY (fake)
   LOCATION './*/data.parquet'
   """)
   ```
   
   The loaded data is missing `col1` and `col2`:
   ```
   >>> ctx.sql("SELECT * FROM data")
   DataFrame()
   +------+
   | fake |
   +------+
   | 0    |
   | 0    |
   +------+
   >>> ctx.sql("SELECT table_name, column_name FROM information_schema.columns")
   DataFrame()
   +------------+-------------+
   | table_name | column_name |
   +------------+-------------+
   | data       | fake        |
   +------------+-------------+
   ```
   
   **Expected behavior**
   The same steps with DataFusion `34.0.0` produce the following output:
   ```
   >>> ctx.sql("SELECT * FROM data");
   DataFrame()
   +------+------+------+
   | col1 | col2 | fake |
   +------+------+------+
   | 1    | 3    | 0    |
   | 2    | 4    | 0    |
   +------+------+------+
   >>> ctx.sql("SELECT table_name, column_name FROM information_schema.columns")
   DataFrame()
   +------------+-------------+
   | table_name | column_name |
   +------------+-------------+
   | data       | col1        |
   | data       | col2        |
   | data       | fake        |
   +------------+-------------+
   ```
   
   **Additional context**
   Operating system: Rocky 8
   Python version: `3.10.11`
   DataFusion version: `35.0.0`, recently installed via pip
   pyarrow version: `15.0.0`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Different behavior in datafusion 35.0.0 in reading hive-partitioned parquet data [arrow-datafusion-python]

Reply via email to