Re: [I] [SUPPORT] MOR table behavior for Spark Bulk insert to COW [hudi]

via GitHub Fri, 25 Oct 2024 01:40:44 -0700


geserdugarov commented on issue #12133:
URL: https://github.com/apache/hudi/issues/12133#issuecomment-2437223439


   @ad1happy2go  I've prepared local Spark 3.5.3 cluster and prepared test to 
reproduce this bug using PySpark. The script is available here:
   
https://github.com/geserdugarov/test-hudi-issues/blob/main/HUDI-8394/write-COW-get-MOR.py
   
   After
   ```SQL
   INSERT INTO cow_or_mor VALUES (5, 10);
   INSERT INTO cow_or_mor VALUES (9, 30);
   ```
   for
   ```SQL
   SELECT * FROM cow_or_mor;
   ```
   I got:
   ```Text
   ('5', '', 
'00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet', 5, 
10)
   ```
   We see only one row, and missed the second one with `id=9`, because it's 
placed in a log file, despite the fact that we set COW table:
   ```Bash
   tree -a /tmp/write-COW-get-MOR
   # .
   # ├── 
00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet
   # ├── 
.00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet.crc
   # ├── .00000000-dad4-4358-aaad-767a76e43e70-0_20241025153606721.log.1_0-30-28
   # ├── 
..00000000-dad4-4358-aaad-767a76e43e70-0_20241025153606721.log.1_0-30-28.crc
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] MOR table behavior for Spark Bulk insert to COW [hudi]

Reply via email to