geserdugarov commented on issue #12133: URL: https://github.com/apache/hudi/issues/12133#issuecomment-2437223439
@ad1happy2go I've prepared local Spark 3.5.3 cluster and prepared test to reproduce this bug using PySpark. The script is available here: https://github.com/geserdugarov/test-hudi-issues/blob/main/HUDI-8394/write-COW-get-MOR.py After ```SQL INSERT INTO cow_or_mor VALUES (5, 10); INSERT INTO cow_or_mor VALUES (9, 30); ``` for ```SQL SELECT * FROM cow_or_mor; ``` I got: ```Text ('5', '', '00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet', 5, 10) ``` We see only one row, and missed the second one with `id=9`, because it's placed in a log file, despite the fact that we set COW table: ```Bash tree -a /tmp/write-COW-get-MOR # . # ├── 00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet # ├── .00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet.crc # ├── .00000000-dad4-4358-aaad-767a76e43e70-0_20241025153606721.log.1_0-30-28 # ├── ..00000000-dad4-4358-aaad-767a76e43e70-0_20241025153606721.log.1_0-30-28.crc ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
