RajasekarSribalan opened a new issue #1939:
URL: https://github.com/apache/hudi/issues/1939
**Describe the problem you faced**
As per Hudi documentation, the size of parquet will be decided based on the
configuration provided for limitFileSize? We use the default file size(120MB)
but we could see many parquet being created with size exceeding 1Gb?
384.8 M 1.1 G
/user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_77-585-137386_20200809071021.parquet
413.2 M 1.2 G
/user/xxxx/a2d996f2-5846-48a2-ab93-aa3630041962-0_79-655-153282_20200809073207.parquet
497.5 M 1.5 G
/user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_12-585-137353_20200809071021.parquet
525.1 M 1.5 G
/user/xxxx/c03cafef-ebbf-4880-90b1-9ebf0311a333-0_13-655-153247_20200809073207.parquet
405.4 M 1.2 G
/user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_43-655-153260_20200809073207.parquet
381.2 M 1.1 G
/user/xxxx/c077189b-f9f2-41f6-b6d2-47f3dd85079c-0_45-585-137368_20200809071021.parquet
543.2 M 1.6 G
/user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_21-585-137359_20200809071021.parquet
568.6 M 1.7 G
/user/xxxx/c45dcb29-81a0-4844-974c-0672ac347f61-0_24-655-153249_20200809073207.parquet
941.8 M 2.8 G
/user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_97-585-137394_20200809071021.parquet
1012.1 M 3.0 G
/user/xxxx/c5762f35-4e71-426c-9de9-e1a4ebda484e-0_99-655-153289_20200809073207.parquet
481.7 M 1.4 G
/user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_91-585-137390_20200809071021.parquet
522.4 M 1.5 G
/user/xxxx/c69df710-f28b-4f29-b489-c8733a171d83-0_93-655-153285_20200809073207.parquet
776.8 M 2.3 G
/user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_64-655-153271_20200809073207.parquet
743.1 M 2.2 G
/user/xxxx/d5b13ebe-6f45-4dc4-bb0c-5fe43aa77cf1-0_65-585-137378_20200809071021.parquet
352.4 M 1.0 G
/user/xxxx/db290d35-277e-4c96-86eb-5036784ac437-0_76-655-153297_20200809073207.parquet
437.4 M 1.3 G
/user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-585-137397_20200809071021.parquet
462.1 M 1.4 G
/user/xxxx/de3d21ac-3f4a-4cde-8cc1-c71b1b45f356-0_104-655-153292_20200809073207.parquet
710.7 M 2.1 G
/user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_61-655-153270_20200809073207.parquet
671.8 M 2.0 G
/user/xxxx/e72f8dc8-0aa3-49bf-9c3b-1b550ee0b25e-0_63-585-137374_20200809071021.parquet
406.3 M 1.2 G
/user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-585-137358_20200809071021.parquet
426.3 M 1.2 G
/user/xxxx/e8e9d6b2-7680-4069-93c3-422887d6f33f-0_14-655-153248_20200809073207.parquet
382.9 M 1.1 G
/user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_42-655-153276_20200809073207.parquet
363.0 M 1.1 G
/user/xxxx/fb0442b4-6739-47c5-a462-1da4c4482d24-0_43-585-137366_20200809071021.parque
We use COW table and streaming the data from Kafka to Hudi table via spark.
2. Hudi jobs is wailing with "Reason: Container killed by YARN for exceeding
memory limits. 30.3 GB of 30 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead." eventhough we give higher executor memory.
We aways get this issue during Hudi parquet write/rewrite stage. We gave only 1
core with 30GB per executor and we are still getting this issue. We tried
increasing the executor memory but still it fails. PLs advice. How we can solve
this issue? Increasing executor memory doesnt solve the issue. Is this because
we have parquet with huge sizes?
**Expected behavior**
Parquet size should be less than or equal to 120MB
Parquet files with
**Environment Description**
* Hudi version : 0.5.2
* Spark version : 2.2.0
* Hive version : 1.X
* Hadoop version : 2.7
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]