ryan-williams opened a new issue, #39399: URL: https://github.com/apache/arrow/issues/39399
### Describe the bug, including details regarding any error messages, version, and platform. Full repro: [runsascoded/parquet-diff-test](https://github.com/runsascoded/parquet-diff-test) [This GitHub Action](https://github.com/runsascoded/parquet-diff-test/actions/runs/7366550240) generated a 0-row, 1-col Parquet file using each {engine, codec, OS}: - {`pyarrow`, `fastparquet`} - {`snappy`, `gzip`, `brotli`, `lz4`, `zstd`} - {macOS, Windows, Ubuntu} All `fastparquet` Parquets are identical across OSes (for each codec), but **`pyarrow` files differ, especially on macOS.** ## pyarrow - All macOS Parquets differ (by ≈600 bytes in the middle, beginning around `0x2b4`) - Windows/Ubuntu differ under `gzip` (by one header byte) | | Ubuntu | Windows | macOS | |-------:|-------:|--------:|------:| | brotli | ✅ | ✅ | ❌ | | gzip | ⚠️ | ⚠️ | ❌ | | lz4 | ✅ | ✅ | ❌ | | snappy | ✅ | ✅ | ❌ | | zstd | ✅ | ✅ | ❌ | Full diffs: - [`ubuntu..macos`](https://github.com/runsascoded/parquet-diff-test/compare/ubuntu..macos) - [`ubuntu..windows`](https://github.com/runsascoded/parquet-diff-test/compare/ubuntu..windows) Examples: <details> <summary><code>git diff ubuntu..macos -- out/pyarrow/snappy/xxd.txt</code> </summary> ```diff 00000280: 7741 4141 4145 4141 6741 4367 4141 414e wAAAAEAAgACgAAAN 00000290: 7742 4141 4145 4141 4141 4151 4141 4141 wBAAAEAAAAAQAAAA 000002a0: 7741 4141 4149 4141 7741 4241 4149 4141 wAAAAIAAwABAAIAA -000002b0: 6741 4141 4149 4141 4141 4541 4141 4141 gAAAAIAAAAEAAAAA -000002c0: 5941 4141 4277 5957 356b 5958 4d41 414b YAAABwYW5kYXMAAK -000002d0: 5942 4141 4237 496d 6c75 5a47 5634 5832 YBAAB7ImluZGV4X2 -000002e0: 4e76 6248 5674 626e 4d69 4f69 4262 6579 NvbHVtbnMiOiBbey -000002f0: 4a72 6157 356b 496a 6f67 496e 4a68 626d JraW5kIjogInJhbm -00000300: 646c 4969 7767 496d 3568 6257 5569 4f69 dlIiwgIm5hbWUiOi -00000310: 4275 6457 7873 4c43 4169 6333 5268 636e BudWxsLCAic3Rhcn -00000320: 5169 4f69 4177 4c43 4169 6333 5276 6343 QiOiAwLCAic3RvcC -00000330: 4936 4944 4173 4943 4a7a 6447 5677 496a I6IDAsICJzdGVwIj -00000340: 6f67 4d58 3164 4c43 4169 5932 3973 6457 ogMX1dLCAiY29sdW -00000350: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69 1uX2luZGV4ZXMiOi -00000360: 4262 6579 4a75 5957 316c 496a 6f67 626e BbeyJuYW1lIjogbn -00000370: 5673 6243 7767 496d 5a70 5a57 786b 5832 VsbCwgImZpZWxkX2 -00000380: 3568 6257 5569 4f69 4275 6457 7873 4c43 5hbWUiOiBudWxsLC -00000390: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG -000003a0: 5569 4f69 4169 6457 3570 5932 396b 5a53 UiOiAidW5pY29kZS -000003b0: 4973 4943 4a75 6457 3177 6556 3930 6558 IsICJudW1weV90eX -000003c0: 426c 496a 6f67 496d 3969 616d 566a 6443 BlIjogIm9iamVjdC -000003d0: 4973 4943 4a74 5a58 5268 5a47 4630 5953 IsICJtZXRhZGF0YS -000003e0: 4936 4948 7369 5a57 356a 6232 5270 626d I6IHsiZW5jb2Rpbm -000003f0: 6369 4f69 4169 5656 5247 4c54 6769 6658 ciOiAiVVRGLTgifX -00000400: 3164 4c43 4169 5932 3973 6457 3175 6379 1dLCAiY29sdW1ucy -00000410: 4936 4946 7437 496d 3568 6257 5569 4f69 I6IFt7Im5hbWUiOi -00000420: 4169 5953 4973 4943 4a6d 6157 5673 5a46 AiYSIsICJmaWVsZF -00000430: 3975 5957 316c 496a 6f67 496d 4569 4c43 9uYW1lIjogImEiLC -00000440: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG -00000450: 5569 4f69 4169 6157 3530 4e6a 5169 4c43 UiOiAiaW50NjQiLC -00000460: 4169 626e 5674 6348 6c66 6448 6c77 5a53 AibnVtcHlfdHlwZS -00000470: 4936 4943 4a70 626e 5132 4e43 4973 4943 I6ICJpbnQ2NCIsIC -00000480: 4a74 5a58 5268 5a47 4630 5953 4936 4947 JtZXRhZGF0YSI6IG -00000490: 3531 6247 7839 5853 7767 496d 4e79 5a57 51bGx9XSwgImNyZW -000004a0: 4630 6233 4969 4f69 4237 496d 7870 596e F0b3IiOiB7ImxpYn -000004b0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e JhcnkiOiAicHlhcn -000004c0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157 JvdyIsICJ2ZXJzaW -000004d0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69 9uIjogIjE0LjAuMi -000004e0: 4a39 4c43 4169 6347 4675 5a47 467a 5833 J9LCAicGFuZGFzX3 -000004f0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69 ZlcnNpb24iOiAiMi -00000500: 3478 4c6a 5169 6651 4141 4151 4141 4142 4xLjQifQAAAQAAAB +000002b0: 6741 4141 4330 4151 4141 4241 4141 414b gAAAC0AQAABAAAAK +000002c0: 5942 4141 4237 496d 6c75 5a47 5634 5832 YBAAB7ImluZGV4X2 +000002d0: 4e76 6248 5674 626e 4d69 4f69 4262 6579 NvbHVtbnMiOiBbey +000002e0: 4a72 6157 356b 496a 6f67 496e 4a68 626d JraW5kIjogInJhbm +000002f0: 646c 4969 7767 496d 3568 6257 5569 4f69 dlIiwgIm5hbWUiOi +00000300: 4275 6457 7873 4c43 4169 6333 5268 636e BudWxsLCAic3Rhcn +00000310: 5169 4f69 4177 4c43 4169 6333 5276 6343 QiOiAwLCAic3RvcC +00000320: 4936 4944 4173 4943 4a7a 6447 5677 496a I6IDAsICJzdGVwIj +00000330: 6f67 4d58 3164 4c43 4169 5932 3973 6457 ogMX1dLCAiY29sdW +00000340: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69 1uX2luZGV4ZXMiOi +00000350: 4262 6579 4a75 5957 316c 496a 6f67 626e BbeyJuYW1lIjogbn +00000360: 5673 6243 7767 496d 5a70 5a57 786b 5832 VsbCwgImZpZWxkX2 +00000370: 3568 6257 5569 4f69 4275 6457 7873 4c43 5hbWUiOiBudWxsLC +00000380: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG +00000390: 5569 4f69 4169 6457 3570 5932 396b 5a53 UiOiAidW5pY29kZS +000003a0: 4973 4943 4a75 6457 3177 6556 3930 6558 IsICJudW1weV90eX +000003b0: 426c 496a 6f67 496d 3969 616d 566a 6443 BlIjogIm9iamVjdC +000003c0: 4973 4943 4a74 5a58 5268 5a47 4630 5953 IsICJtZXRhZGF0YS +000003d0: 4936 4948 7369 5a57 356a 6232 5270 626d I6IHsiZW5jb2Rpbm +000003e0: 6369 4f69 4169 5656 5247 4c54 6769 6658 ciOiAiVVRGLTgifX +000003f0: 3164 4c43 4169 5932 3973 6457 3175 6379 1dLCAiY29sdW1ucy +00000400: 4936 4946 7437 496d 3568 6257 5569 4f69 I6IFt7Im5hbWUiOi +00000410: 4169 5953 4973 4943 4a6d 6157 5673 5a46 AiYSIsICJmaWVsZF +00000420: 3975 5957 316c 496a 6f67 496d 4569 4c43 9uYW1lIjogImEiLC +00000430: 4169 6347 4675 5a47 467a 5833 5235 6347 AicGFuZGFzX3R5cG +00000440: 5569 4f69 4169 6157 3530 4e6a 5169 4c43 UiOiAiaW50NjQiLC +00000450: 4169 626e 5674 6348 6c66 6448 6c77 5a53 AibnVtcHlfdHlwZS +00000460: 4936 4943 4a70 626e 5132 4e43 4973 4943 I6ICJpbnQ2NCIsIC +00000470: 4a74 5a58 5268 5a47 4630 5953 4936 4947 JtZXRhZGF0YSI6IG +00000480: 3531 6247 7839 5853 7767 496d 4e79 5a57 51bGx9XSwgImNyZW +00000490: 4630 6233 4969 4f69 4237 496d 7870 596e F0b3IiOiB7ImxpYn +000004a0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e JhcnkiOiAicHlhcn +000004b0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157 JvdyIsICJ2ZXJzaW +000004c0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69 9uIjogIjE0LjAuMi +000004d0: 4a39 4c43 4169 6347 4675 5a47 467a 5833 J9LCAicGFuZGFzX3 +000004e0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69 ZlcnNpb24iOiAiMi +000004f0: 3478 4c6a 5169 6651 4141 4267 4141 4148 4xLjQifQAABgAAAH +00000500: 4268 626d 5268 6377 4141 4151 4141 4142 BhbmRhcwAAAQAAAB 00000510: 5141 4141 4151 4142 5141 4341 4147 4141 QAAAAQABQACAAGAA 00000520: 6341 4441 4141 4142 4141 4541 4141 4141 cADAAAABAAEAAAAA 00000530: 4141 4151 4951 4141 4141 4841 4141 4141 AAAQIQAAAAHAAAAA ``` </details> <details> <summary><code>git diff ubuntu..windows -- out/pyarrow/gzip/xxd.txt</code></summary> ```diff 00000000: 5041 5231 1504 1500 1528 4c15 0015 0012 PAR1.....(L..... -00000010: 0000 1f8b 0800 0000 0000 0003 0300 0000 ................ +00000010: 0000 1f8b 0800 0000 0000 000a 0300 0000 ................ 00000020: 0000 0000 0000 264c 1c15 0419 2500 0619 ......&L....%... 00000030: 1801 6115 0416 0016 1c16 4426 0026 0829 ..a.......D&.&.) 00000040: 1c15 0415 0015 0200 0000 1504 192c 3500 .............,5. ``` </details> I've not found a tool that can tell me why the files differ (having tried [`parquet-tools`](https://github.com/hangxie/parquet-tools), [`parquet2json`](https://github.com/jupiter/parquet2json), and pyarrow's `ParquetFile.metadata`). ### fastparquet In contrast, `fastparquet` Parquets match across all OS's: | | Ubuntu | Windows | macOS | |-------:|-------:|--------:|------:| | brotli | ✅ | ✅ | ✅ | | gzip | ✅ | ✅ | ✅ | | lz4 | ✅ | ✅ | ✅ | | snappy | ✅ | ✅ | ✅ | | zstd | ✅ | ✅ | ✅ | ## Questions 1. Why are the files different on macOS vs. Ubuntu? 2. Is there a way to generate identical files across OS's, with `pyarrow`? 3. Are there tools that can help parse/display the differences? ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
