Re: [I] [Python] How to do arrow table group by and split? [arrow]

via GitHub Thu, 19 Dec 2024 10:33:06 -0800


e10v commented on issue #14882:
URL: https://github.com/apache/arrow/issues/14882#issuecomment-2555518906


   Here is the alternative proposed by ChatGPT :)
   
   
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   
   # Example data
   table = pa.Table.from_pydict({
       "column_a": pa.array([1, 2, 1, 3, 2]),
       "column_b": pa.array(["x", "y", "z", "x", "y"])
   })
   
   group_col = "column_a"
   
   # 1. Sort indices by the grouping column
   sort_indices = pc.sort_indices(table, sort_keys=[(group_col, "ascending")])
   sorted_table = table.take(sort_indices)
   
   # 2. Identify boundaries where column_a changes
   col = sorted_table[group_col]
   # Compare adjacent values to find boundaries:
   equal_to_previous = pc.equal(col[1:], col[:-1])
   changes = pc.invert(equal_to_previous)  # True where value changes
   
   # Convert boolean array to offsets
   # Start at 0
   offsets = [0]
   # Add 1 to shift indices so they match starts of groups
   for i, changed in enumerate(changes.to_pylist(), start=1):
       if changed:
           offsets.append(i)
   # Final boundary is the length of the table
   offsets.append(len(col))
   
   # 3. Slice groups
   for start, end in zip(offsets, offsets[1:]):
       subtable = sorted_table.slice(start, end - start)
       key = subtable[group_col][0].as_py()
       ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] How to do arrow table group by and split? [arrow]

Reply via email to