[GitHub] [arrow-datafusion] matthewmturner opened a new issue #1507: Python bindings create duplicated qualified fields after joining

GitBox Wed, 29 Dec 2021 13:00:06 -0800


matthewmturner opened a new issue #1507:
URL: https://github.com/apache/arrow-datafusion/issues/1507



   **Describe the bug**
   im working on getting datafusion added to db-benchmark (#147).  while 
putting the benchmarks together i came across an error while doing the join 
benchmark that i wasnt expecting.  specifically the error is:
   ```
   Traceback (most recent call last):
     File "datafusion/join-datafusion.py", line 72, in <module>
       df = ctx.create_dataframe([ans])
   Exception: DataFusion error: Plan("Schema contains duplicate qualified field 
name 'ce9f0daee780e4f2796b9953bd267457c.id1'")
   ```
   The test code that produced that is here:
   ```
   question = "small inner on int" # q1
   gc.collect()
   t_start = timeit.default_timer()
   ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = 
small.id1").collect()
   shape = ans_shape(ans)
   print(shape)
   t = timeit.default_timer() - t_start
   t_start = timeit.default_timer()
   df = ctx.create_dataframe([ans])
   chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0]
   chkt = timeit.default_timer() - t_start
   m = memory_usage()
   write_log(task=task, data=data_name, in_rows=x_data.num_rows, 
question=question, out_rows=shape[0], out_cols=shape[1], solution=solution, 
version=ver, git=git, fun=fun, run=1, time_sec=t, mem_gb=m, cache=cache, 
chk=make_chk([chk]), chk_time_sec=chkt, on_disk=on_disk)
   del ans
   gc.collect()
   ```
   if i update the sql to:
   ```
   SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, x.v1, small.v2 
FROM x INNER JOIN small ON x.id1 = small.id1
   ```
   I get:
   ```
   Traceback (most recent call last):
     File "datafusion/join-datafusion.py", line 73, in <module>
       df = ctx.create_dataframe([ans])
   Exception: DataFusion error: Plan("Schema contains duplicate qualified field 
name 'cb53bcf8886f449c3bd2651571df185d4.id4'")
   ```
   to me this looks like a bug as i think i should be able to write the query 
without having to alias the overlapping columns (when i alias the overlapping 
columns it works). for example, below is the equivalent spark query.
   ```
   select * from x join small using (id1)
   ```
   
   **To Reproduce**
   Steps to reproduce the behavior:
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   I should be able to run either of the following
   ```
   ans = ctx.sql("SELECT x.id1, x.id2, x.id3, x.id4, small.id4, x.id5, x.id6, 
x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1").collect()
   df = ctx.create_dataframe([ans])
   ```
   ```
   ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = 
small.id1").collect()
   df = ctx.create_dataframe([ans])
   ```
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] matthewmturner opened a new issue #1507: Python bindings create duplicated qualified fields after joining

Reply via email to