matthewmturner opened a new issue #1507:
URL: https://github.com/apache/arrow-datafusion/issues/1507
**Describe the bug**
im working on getting datafusion added to db-benchmark (#147). while
putting the benchmarks together i came across an error while doing the join
benchmark that i wasnt expecting. specifically the error is:
```
Traceback (most recent call last):
File "datafusion/join-datafusion.py", line 72, in <module>
df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field
name 'ce9f0daee780e4f2796b9953bd267457c.id1'")
```
The test code that produced that is here:
```
question = "small inner on int" # q1
gc.collect()
t_start = timeit.default_timer()
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 =
small.id1").collect()
shape = ans_shape(ans)
print(shape)
t = timeit.default_timer() - t_start
t_start = timeit.default_timer()
df = ctx.create_dataframe([ans])
chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0]
chkt = timeit.default_timer() - t_start
m = memory_usage()
write_log(task=task, data=data_name, in_rows=x_data.num_rows,
question=question, out_rows=shape[0], out_cols=shape[1], solution=solution,
version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache,
chk=make_chk([chk]), chk_time_sec=chkt, on_disk=on_disk)
del ans
gc.collect()
```
if i update the sql to:
```
SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2
FROM x INNER JOIN small ON x.id1 = small.id1
```
I get:
```
Traceback (most recent call last):
File "datafusion/join-datafusion.py", line 73, in <module>
df = ctx.create_dataframe([ans])
Exception: DataFusion error: Plan("Schema contains duplicate qualified field
name 'cb53bcf8886f449c3bd2651571df185d4.id4'")
```
to me this looks like a bug as i think i should be able to write the query
without having to alias the overlapping columns (when i alias the overlapping
columns it works). for example, below is the equivalent spark query.
```
select * from x join small using (id1)
```
**To Reproduce**
Steps to reproduce the behavior:
**Expected behavior**
A clear and concise description of what you expected to happen.
I should be able to run either of the following
```
ans = ctx.sql("SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6,
x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1").collect()
df = ctx.create_dataframe([ans])
```
```
ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 =
small.id1").collect()
df = ctx.create_dataframe([ans])
```
**Additional context**
Add any other context about the problem here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]