RajasekarSribalan opened a new issue #1939:
URL: https://github.com/apache/hudi/issues/1939


   **Describe the problem you faced**
   
   As per Hudi documentation, the size of parquet will be decided based on the 
configuration provided for limitFileSize? We use the default file size(120MB) 
but we could see many parquet being created with size exceeding 1Gb?
   
   384.8 M  1.1 G  
/user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_77-585-137386_20200809071021.parquet
   413.2 M  1.2 G  
/user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_79-655-153282_20200809073207.parquet
   497.5 M  1.5 G  
/user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_12-585-137353_20200809071021.parquet
   525.1 M  1.5 G  
/user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_13-655-153247_20200809073207.parquet
   405.4 M  1.2 G  
/user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_43-655-153260_20200809073207.parquet
   381.2 M  1.1 G  
/user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_45-585-137368_20200809071021.parquet
   543.2 M  1.6 G  
/user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_21-585-137359_20200809071021.parquet
   568.6 M  1.7 G  
/user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_24-655-153249_20200809073207.parquet
   941.8 M  2.8 G  
/user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_97-585-137394_20200809071021.parquet
   1012.1 M  3.0 G  
/user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_99-655-153289_20200809073207.parquet
   481.7 M  1.4 G  
/user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_91-585-137390_20200809071021.parquet
   522.4 M  1.5 G  
/user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_93-655-153285_20200809073207.parquet
   776.8 M  2.3 G  
/user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_64-655-153271_20200809073207.parquet
   743.1 M  2.2 G  
/user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_65-585-137378_20200809071021.parquet
   352.4 M  1.0 G  
/user/xxxx/db290d35-277e-4c96-86eb-5036784ac437-0_76-655-153297_20200809073207.parquet
   437.4 M  1.3 G  
/user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-585-137397_20200809071021.parquet
   462.1 M  1.4 G  
/user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-655-153292_20200809073207.parquet
   710.7 M  2.1 G  
/user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_61-655-153270_20200809073207.parquet
   671.8 M  2.0 G  
/user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_63-585-137374_20200809071021.parquet
   406.3 M  1.2 G  
/user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-585-137358_20200809071021.parquet
   426.3 M  1.2 G  
/user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-655-153248_20200809073207.parquet
   382.9 M  1.1 G  
/user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_42-655-153276_20200809073207.parquet
   363.0 M  1.1 G  
/user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_43-585-137366_20200809071021.parque
   
   We use COW table and streaming the data from Kafka to Hudi table via spark. 
   
   2. Hudi jobs is wailing with "Reason: Container killed by YARN for exceeding 
memory limits. 30.3 GB of 30 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead." eventhough we give higher executor memory. 
We aways get this issue during Hudi parquet write/rewrite stage. We gave only 1 
core with 30GB per executor and we are still getting this issue. We tried 
increasing the executor memory but still it fails. PLs advice. How we can solve 
this issue? Increasing executor memory doesnt solve the issue. Is this because 
we have parquet with huge sizes?
   
   **Expected behavior**
   
   Parquet size should be less than or equal to 120MB
   
   Parquet files with
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   * Spark version : 2.2.0
   
   * Hive version : 1.X
   
   * Hadoop version : 2.7
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : No
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to