nosterlu commented on issue #13747:
URL: https://github.com/apache/arrow/issues/13747#issuecomment-1271311187
Thank you @legout. Duckdb works really well, but polars is struggling. Maybe
I am doing something wrong.
But anyway here is how it worked for me
```python
# pyarrow 8.0.0
# duckdb 0.5.1
# polars 0.14.18
ib = dataset("install-base-from-vdw-standard/", filesystem=fs,
partitioning="hive")
ib.count_rows()
# 1415259797
ib.schema
"""
bev: bool
market: int16
function_group: int32
part: int32
kdp: bool
kdp_accessory: bool
yearweek: int32
qty_part: int32
vehicle_type: int32
model_year: int32
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' +
1081
"""
def do_duckdb():
sql = """
SELECT i.part,
i.bev,
i.market,
kdp_accessory,
yearweek,
SUM(i.qty_part) as qty_part_sum,
FROM ib i
WHERE vehicle_type=536
GROUP BY
i.part,
i.bev,
i.market,
i.kdp_accessory,
yearweek
"""
conn = duckdb.connect(":memory:")
result = conn.execute(sql)
table = result.fetch_arrow_table()
return table
def do_polar():
table = (
pl.scan_ds(ib)
.filter("vehicle_type" == 536)
.groupby(["part", "bev", "market", "kdp_accessory", "yearweek"])
.agg(pl.col("qty_part").sum())
.collect()
.to_arrow()
)
return table
%time table = do_duckdb()
# memory consumption increased temporarily with 2GB, 18.8s
%time table = do_polar()
# memory consumption increased slowly to fill almost all memory (32GB) before
# normalizing, 4min 54s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]