0x26res opened a new issue, #34238:
URL: https://github.com/apache/arrow/issues/34238

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   
   - I have a table with ~100k records and 4 chunks of various sizes `[32768, 
32768, 29599, 16692]`
   - key1 has got high cardinality (30k)
   - key2 has got low cardinality (10)
   
   
   | key1       | key2    |     value |
   |:-----------|:--------|----------:|
   | KEY1_12226 | KEY1_08 | 0.348599  |
   | KEY1_10214 | KEY1_08 | 0.954173  |
   | KEY1_26821 | KEY1_09 | 0.416615  |
   | KEY1_24557 | KEY1_06 | 0.883226  |
   | KEY1_27823 | KEY1_08 | 0.0127225 |
   
   I'm trying to do a groupby on key1, key2 to get the sum of value. It works 
fine in general. But when I preprocess the data I misalign the chunks in my 
table and it fails.
   
   ```
   import numpy as np
   import pyarrow as pa
   
   KEYS_1 = [f"KEY1_{i:05d}" for i in range(30_000)]
   KEYS_2 = [f"KEY1_{i:02d}" for i in range(10)]
   SIDE = ["LEFT", "RIGHT"]
   
   
   def generate_table(sizes):
       batches = [
           pa.record_batch(
               [
                   np.random.choice(KEYS_1, size),
                   np.random.choice(KEYS_2, size),
                   np.random.rand(size),
               ],
               ["key1", "key2", "value"],
           )
           for size in sizes
       ]
       return pa.Table.from_batches(batches)
   
   
   table = generate_table([32768, 32768, 29599, 16692])
   
   # This works well:
   pa.TableGroupBy(table, ["key1", "key2"]).aggregate(
       [
           ["value", "sum"],
       ]
   )
   
   # This misaligns the chunks
   table = table.set_column(
       table.schema.get_field_index("value"),
       "value",
       table["value"].combine_chunks(),
   )
   
   print("HERE")
   pa.TableGroupBy(table, ["key1", "key2"]).aggregate(
       [
           ["value", "sum"],
       ]
   )  # segfault :-(
   print("NEVER THERE")
   
   ```
   
   It took me a while to go to the bottom of the problem. The size of the 
chunks and the cardinality of the keys seem to play an important factor in 
weather it fails or not.
   
   The short term solution is for me to call combine_chunks before the groupby.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to