[I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

via GitHub Tue, 20 Feb 2024 14:53:10 -0800


yizhenglu opened a new issue, #10716:
URL: https://github.com/apache/hudi/issues/10716


   **_Tips before filing an issue_**
   
   **Describe the problem you faced**
   
   Background:
   Currently, I have around 100 mb data for each day (batch process), so I am 
using the delete operation with broadcast join in spark to delete the unused 
data(data will be updated the next day), I try to avoid global index scanning, 
so I insert delta data into a new partition. 
   
   **Parquet size Issue:**
   Set hoodie. shuffle. parallelism =2, does not change the **parquet size** 
when using **insert**.
   It generates 40 parquets for one partitiom(daily), total size: 180 mb
   
   **Athena Issue:**
   Due to the small number of parquets, I implemented clustering (inline) with 
max commits =1 for test.
   
   Athena Raises Error shows:
   Generic_INTERNAL_ERROR: Can not read value at 0 in block -1 in 
S3/XXXXXX/XXX/XXXX/XXXXX/date=20XX-XX-XX.parquet.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Hudi Options like below:
   Table ddl (_rt, _ro) was created automatically when enabling the hive sync 
tool.
   
![00a427c6f6f4d8e2a73b07ccf75bdf5](https://github.com/apache/hudi/assets/46934296/ca8fb5d0-2905-4292-8ae1-3309b23250a8)
   record keys are strings, partition keys are just dates in yyyy-mm-dd 
   2.
   
![e3471b52ec345815027c1dd2a759cce](https://github.com/apache/hudi/assets/46934296/8d96a9ce-fc89-41a7-9d3c-4be5c826fe4d)
   
   
   **Expected behavior**
   
   I wish to keep my parquet size around 120 MB for each file (one partition) 
or two parquet files (around 64 MB).
   
   
   
   **Environment Description**
   
   * Hudi version : 0.13.0
   
   * Spark version : 3.4.1
   
   * Hive version : 0.13.1
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : NO
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Spark Write into MoR type hudi table small parquets issue + Athena Internal Error [hudi]

Reply via email to