[GitHub] [arrow-datafusion-python] mesejo commented on issue #64: Python bindings create duplicated qualified fields after joining

via GitHub Mon, 21 Aug 2023 02:32:49 -0700


mesejo commented on issue #64:
URL: 
https://github.com/apache/arrow-datafusion-python/issues/64#issuecomment-1685981996


   I can no longer reproduce this issue with the current version of DataFusion 
(`28.0.0`):
   
   ```python
   import datafusion
   import pandas as pd
   
   x = pd.DataFrame(data={'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 
'col3': [3, 4, 1, 2, 3]})
   x.to_csv('x.csv', index=False)
   
   x = pd.DataFrame(data={'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 
'col3': [5, 6, 7, 8, 9]})
   x.to_csv('small.csv', index=False)
   
   
   ctx = datafusion.SessionContext()
   ctx.register_csv(name="x", path="x.csv")
   ctx.register_csv(name="small", path="small.csv")
   
   
   df = ctx.sql("SELECT * FROM x INNER JOIN small ON small.id1 = x.id1")
   df.show()
   ```
   **Output**
   ```
    DataFrame()
   +-----+------+------+-----+------+------+
   | id1 | col2 | col3 | id1 | col2 | col3 |
   +-----+------+------+-----+------+------+
   | 2   | 4    | 4    | 2   | 4    | 6    |
   | 1   | 3    | 3    | 1   | 3    | 5    |
   | 5   | 5    | 2    | 5   | 5    | 8    |
   | 4   | 3    | 1    | 4   | 3    | 7    |
   | 6   | 2    | 3    | 6   | 2    | 9    |
   +-----+------+------+-----+------+------+
   ```
   
   However, when I try to transform it into a pandas DataFrame, I do get an 
error:
   ```python
   df.to_pandas()
   ```
   **Error**
   ```
   Traceback (most recent call last):
     File "bug.py", line 19, in <module>
       df.to_pandas()
     File "pyarrow/array.pxi", line 837, in 
pyarrow.lib._PandasConvertible.to_pandas
     File "pyarrow/table.pxi", line 4114, in pyarrow.lib.Table._to_pandas
     File "/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 819, 
in table_to_blockmanager
       columns = _deserialize_column_index(table, all_columns, column_indexes)
     File "/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 938, 
in _deserialize_column_index
       columns = _flatten_single_level_multiindex(columns)
     File "/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1184, 
in _flatten_single_level_multiindex
       raise ValueError('Found non-unique column index')
   ValueError: Found non-unique column index
   ```
   
   This behavior is weird; internally DataFusion knows that the columns come 
from different schemas. This works
   ```
   df.select(col("small.id1"), col("small.col2"), col("small.col3")).collect()
   ```
   
   Perhaps is worth trying to qualify overlapped names when transforming to a 
pandas DataFrame?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion-python] mesejo commented on issue #64: Python bindings create duplicated qualified fields after joining

Reply via email to