[GitHub] [arrow] nosterlu commented on issue #13747: [Python] group_by functionality directly on large dataset, instead of on a table?

GitBox Fri, 07 Oct 2022 01:53:26 -0700


nosterlu commented on issue #13747:
URL: https://github.com/apache/arrow/issues/13747#issuecomment-1271311187


   Thank you @legout. Duckdb works really well, but polars is struggling. Maybe 
I am doing something wrong.
   
   But anyway here is how it worked for me
   
   ```python
   # pyarrow 8.0.0
   # duckdb 0.5.1
   # polars 0.14.18
   ib = dataset("install-base-from-vdw-standard/", filesystem=fs, 
partitioning="hive")
   
   ib.count_rows()
   # 1415259797
   ib.schema
   """
   bev: bool
   market: int16
   function_group: int32
   part: int32
   kdp: bool
   kdp_accessory: bool
   yearweek: int32
   qty_part: int32
   vehicle_type: int32
   model_year: int32
   -- schema metadata --
   pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 
1081
   """
   
   def do_duckdb():
       sql = """
           SELECT i.part,
                  i.bev,
                  i.market,
                  kdp_accessory,                   
                  yearweek,
                  SUM(i.qty_part) as qty_part_sum,
           FROM ib i
           WHERE vehicle_type=536
           GROUP BY
              i.part,
              i.bev,
              i.market,
              i.kdp_accessory,
              yearweek
       """
       conn = duckdb.connect(":memory:")
       result = conn.execute(sql)
       table = result.fetch_arrow_table()
       return table
   
   
   def do_polar():
       table = (
           pl.scan_ds(ib)
           .filter("vehicle_type" == 536)
           .groupby(["part", "bev", "market", "kdp_accessory", "yearweek"])
           .agg(pl.col("qty_part").sum())
           .collect()
           .to_arrow()
       )
       return table
   
   %time table = do_duckdb()
   # memory consumption increased temporarily with 2GB, 18.8s
   
   %time table = do_polar()
   # memory consumption increased slowly to fill almost all memory (32GB) before
   # normalizing, 4min 54s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nosterlu commented on issue #13747: [Python] group_by functionality directly on large dataset, instead of on a table?

Reply via email to