amoeba commented on issue #36295: URL: https://github.com/apache/arrow/issues/36295#issuecomment-2374573245
Hey @zanmato1984, I think you're right that this is now fixed on main. I can reproduce the issue with PyArrow 17 but with PyArrow built from main I no longer get the data corruption. ``` >>> pyarrow.__version__ '18.0.0.dev372+gfc2e018ef1.d20240925' ``` And when I run the script with `COLUMN_COUNT = 5`: ``` ❯ python repro.py -------------------- ORIGINAL -------------------- pyarrow.Table index: uint64 0: uint64 1: uint64 2: uint64 3: uint64 4: uint64 ---- index: [[0,1,2,3,4,...,99999995,99999996,99999997,99999998,99999999]] 0: [[0,0,0,0,0,...,0,0,0,0,0]] 1: [[1,1,1,1,1,...,1,1,1,1,1]] 2: [[2,2,2,2,2,...,2,2,2,2,2]] 3: [[3,3,3,3,3,...,3,3,3,3,3]] 4: [[4,4,4,4,4,...,4,4,4,4,4]] -------------- GROUP_BY / AGGREGATE -------------- pyarrow.Table index: uint64 0: uint64 1: uint64 2: uint64 3: uint64 4: uint64 ---- index: [[0,1,2,3,4,...,98566139,98566140,98566141,98566142,98566143]] 0: [[0,0,0,0,0,...,0,0,0,0,0]] 1: [[1,1,1,1,1,...,1,1,1,1,1]] 2: [[2,2,2,2,2,...,2,2,2,2,2]] 3: [[3,3,3,3,3,...,3,3,3,3,3]] 4: [[4,4,4,4,4,...,4,4,4,4,4]] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
