mesejo commented on issue #64:
URL:
https://github.com/apache/arrow-datafusion-python/issues/64#issuecomment-1685981996
I can no longer reproduce this issue with the current version of DataFusion
(`28.0.0`):
```python
import datafusion
import pandas as pd
x = pd.DataFrame(data={'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2],
'col3': [3, 4, 1, 2, 3]})
x.to_csv('x.csv', index=False)
x = pd.DataFrame(data={'id1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2],
'col3': [5, 6, 7, 8, 9]})
x.to_csv('small.csv', index=False)
ctx = datafusion.SessionContext()
ctx.register_csv(name="x", path="x.csv")
ctx.register_csv(name="small", path="small.csv")
df = ctx.sql("SELECT * FROM x INNER JOIN small ON small.id1 = x.id1")
df.show()
```
**Output**
```
DataFrame()
+-----+------+------+-----+------+------+
| id1 | col2 | col3 | id1 | col2 | col3 |
+-----+------+------+-----+------+------+
| 2 | 4 | 4 | 2 | 4 | 6 |
| 1 | 3 | 3 | 1 | 3 | 5 |
| 5 | 5 | 2 | 5 | 5 | 8 |
| 4 | 3 | 1 | 4 | 3 | 7 |
| 6 | 2 | 3 | 6 | 2 | 9 |
+-----+------+------+-----+------+------+
```
However, when I try to transform it into a pandas DataFrame, I do get an
error:
```python
df.to_pandas()
```
**Error**
```
Traceback (most recent call last):
File "bug.py", line 19, in <module>
df.to_pandas()
File "pyarrow/array.pxi", line 837, in
pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 4114, in pyarrow.lib.Table._to_pandas
File "/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 819,
in table_to_blockmanager
columns = _deserialize_column_index(table, all_columns, column_indexes)
File "/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 938,
in _deserialize_column_index
columns = _flatten_single_level_multiindex(columns)
File "/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1184,
in _flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```
This behavior is weird; internally DataFusion knows that the columns come
from different schemas. This works
```
df.select(col("small.id1"), col("small.col2"), col("small.col3")).collect()
```
Perhaps is worth trying to qualify overlapped names when transforming to a
pandas DataFrame?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]