wegamekinglc opened a new issue, #1186:
URL: https://github.com/apache/datafusion-python/issues/1186
### Describe the bug
Hi team, I have encountered a performance issue when I run same query on a
big table with datafusion comparing with DuckDB.
I will try to simplify my case and replicate the issue in my following codes.
### To Reproduce
```python
import timeit
import numpy as np
import pyarrow as pa
import datafusion
from datafusion import SessionContext
import duckdb
print(duckdb.__version__)
print(datafusion.__version__)
# prepare data
batches = 100000
names = list("abcdefghijklmnopqrstuvwxyz")
names = [n + m for n in names for m in names]
names_array = pa.concat_arrays([pa.array(names)] * batches)
values_array = pa.concat_arrays([pa.array(np.random.randint(1, 100,
len(names))) for _ in range(batches)])
pa_table = pa.Table.from_arrays([names_array, values_array], names=["name",
"value"])
# prepare query
sql = "select name, sum(value) as value FROM pa_table group by name;"
n_round = 10
# duckb
elapsed = timeit.timeit('duckdb.sql(sql).to_arrow_table()', number=n_round ,
globals=globals())
duckdb_per_round = elapsed / n_round
# datafusion
ctx = SessionContext()
_ = ctx.from_arrow(pa_table, "pa_table")
elapsed = timeit.timeit('ctx.sql(sql).to_arrow_table()', number=n_round ,
globals=globals())
datafusion_per_round= elapsed / n_round
# result
print(f"{'duckdb':<12}: {duckdb_per_round * 1000:.2f}ms")
print(f"{'datafusion':<12}: {datafusion_per_round * 1000:.2f}ms")
```
the output will look like:
```bash
1.3.1
47.0.0
duckdb : 152.15ms
datafusion : 1002.04ms
```
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]