srinikandi opened a new issue #5047:
URL: https://github.com/apache/hudi/issues/5047
Hi
I have been using Apache Hudi Connector for Glue (hudi 0.90 version) and
facing small file creation problem while inserting data into a hudi table. The
input file is about 17 GB with 313 parquet part files. Each averaging around 70
mb. When I try to insert the data into Hudi table with overwrite option, this
ends up creating some 7000 plus parquet part files, each with 6.5 MB. I did
utilize the small file size and max file size parameters while writing.
here is the config that I used . The intention was to create file sizes
between 60 - 80 MB.
common_config = {
"hoodie.datasource.write.hive_style_partitioning": "true",
"className": "org.apache.hudi",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.write.recordkey.field": "C1,C2,C3,C4",
"hoodie.datasource.write.precombine.field": "",
"hoodie.table.name": "TEST_TABLE",
"hoodie.datasource.hive_sync.database": "TEST_DATABASE_RAW",
"hoodie.datasource.hive_sync.table": "TEST_TABLE",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.write.partitionpath.field": "",
"hoodie.datasource.hive_sync.support_timestamp": "true",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.parquet.small.file.limit": "62914560",
"hoodie.parquet.max.file.limit": "83886080",
"hoodie.parquet.block.size": "62914560",
"hoodie.insert.shuffle.parallelism": "5",
"hoodie.datasource.write.operation": "insert"
}
full_ref_files is list with all of the parquet part file from the input
folder.
full_load_df = spark.read.parquet(*full_ref_files)
full_load_df.write.format("hudi").options(**conf).mode("overwrite").save(raw_table_path)
Below are the spark logs, where the last line shows that it create some 1000
plus partitions while writing to Hudi table.
Any insights are deeply appreciated.
2022-03-15T06:42:23.502-05:00 | [Stage 1 (javaToPython at
NativeMethodAccessorImpl.java:0):> (0 + 113) / 116]
-- | --
| 2022-03-15T06:42:25.396-05:00 | [Stage 1 (javaToPython at
NativeMethodAccessorImpl.java:0):> (9 + 107) / 116]
| 2022-03-15T06:42:26.894-05:00 | [Stage 1 (javaToPython at
NativeMethodAccessorImpl.java:0):==> (100 + 16) / 116]
| 2022-03-15T06:42:28.379-05:00 | [Stage 3 (collect at
/tmp/stage-to-raw-etl-glue-job-poc-1.py:105):()(83 + 33) / 116]
| 2022-03-15T06:43:06.306-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(0 + 116) / 117]
| 2022-03-15T06:43:12.079-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(1 + 116) / 117]
| 2022-03-15T06:43:19.568-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117]
| 2022-03-15T06:43:22.290-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(3 + 114) / 117]
| 2022-03-15T06:43:23.907-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117]
| 2022-03-15T06:43:25.782-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(5 + 112) / 117]
| 2022-03-15T06:43:27.302-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(10 + 107) / 117]
| 2022-03-15T06:43:31.700-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(11 + 106) / 117]
| 2022-03-15T06:43:35.656-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(12 + 105) / 117]
| 2022-03-15T06:44:05.906-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117]
| 2022-03-15T06:44:15.512-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(13 + 104) / 117]
| 2022-03-15T06:44:20.853-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(16 + 101) / 117]
| 2022-03-15T06:44:53.649-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117]
| 2022-03-15T06:45:08.909-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(17 + 100) / 117]
| 2022-03-15T06:45:14.894-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(19 + 98) / 117]
| 2022-03-15T06:45:26.452-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(20 + 97) / 117]
| 2022-03-15T06:45:36.321-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117]
| 2022-03-15T06:45:38.712-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(21 + 96) / 117]
| 2022-03-15T06:45:41.935-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(23 + 94) / 117]
| 2022-03-15T06:45:43.177-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(25 + 92) / 117]
| 2022-03-15T06:45:48.221-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(26 + 91) / 117]
| 2022-03-15T06:46:00.568-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117]
| 2022-03-15T06:46:06.605-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(28 + 89) / 117]
| 2022-03-15T06:46:10.268-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(29 + 88) / 117]
| 2022-03-15T06:46:12.260-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(31 + 86) / 117]
| 2022-03-15T06:46:15.381-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(33 + 84) / 117]
| 2022-03-15T06:46:18.751-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(34 + 83) / 117]
| 2022-03-15T06:46:20.892-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117]
| 2022-03-15T06:46:22.348-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(36 + 81) / 117]
| 2022-03-15T06:46:23.410-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(38 + 79) / 117]
| 2022-03-15T06:46:25.525-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(40 + 77) / 117]
| 2022-03-15T06:46:27.795-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(45 + 72) / 117]
| 2022-03-15T06:46:29.397-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(46 + 71) / 117]
| 2022-03-15T06:46:30.493-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(49 + 68) / 117]
| 2022-03-15T06:46:31.580-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(50 + 67) / 117]
| 2022-03-15T06:46:33.033-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(53 + 64) / 117]
| 2022-03-15T06:46:34.046-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(59 + 58) / 117]
| 2022-03-15T06:46:35.666-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(64 + 53) / 117]
| 2022-03-15T06:46:36.910-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(66 + 51) / 117]
| 2022-03-15T06:46:38.815-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(68 + 49) / 117]
| 2022-03-15T06:46:40.080-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(69 + 48) / 117]
| 2022-03-15T06:46:41.088-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(72 + 45) / 117]
| 2022-03-15T06:46:42.917-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(76 + 41) / 117]
| 2022-03-15T06:46:44.805-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(80 + 37) / 117]
| 2022-03-15T06:46:46.980-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(84 + 33) / 117]
| 2022-03-15T06:46:48.120-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(86 + 31) / 117]
| 2022-03-15T06:46:49.728-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(89 + 28) / 117]
| 2022-03-15T06:46:50.893-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(90 + 27) / 117]
| 2022-03-15T06:46:52.211-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(93 + 24) / 117]
| 2022-03-15T06:46:53.544-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(97 + 20) / 117]
| 2022-03-15T06:46:54.871-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(102 + 15) / 117]
| 2022-03-15T06:46:55.892-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(105 + 12) / 117]
| 2022-03-15T06:46:58.431-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(107 + 10) / 117]
| 2022-03-15T06:46:59.667-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(110 + 7) / 117]
| 2022-03-15T06:47:02.750-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(113 + 4) / 117]
| 2022-03-15T06:47:14.859-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(114 + 3) / 117]
| 2022-03-15T06:47:17.427-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(116 + 1) / 117]
| 2022-03-15T06:47:21.640-05:00 | [Stage 6 (countByKey at
BaseSparkCommitActionExecutor.java:175):()(117 + 0) / 117]
| 2022-03-15T06:47:27.011-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (0 + 117) / 117]
| 2022-03-15T06:47:28.283-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (2 + 115) / 117]
| 2022-03-15T06:47:29.508-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (6 + 111) / 117]
| 2022-03-15T06:47:30.576-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):()(10 + 107) / 117]
| 2022-03-15T06:47:32.991-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):()(15 + 102) / 117]
| 2022-03-15T06:47:34.455-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (18 + 99) / 117]
| 2022-03-15T06:47:35.560-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (23 + 94) / 117]
| 2022-03-15T06:47:36.575-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (29 + 88) / 117]
| 2022-03-15T06:47:37.689-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (41 + 76) / 117]
| 2022-03-15T06:47:38.788-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (64 + 53) / 117]
| 2022-03-15T06:47:39.865-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (89 + 28) / 117]
| 2022-03-15T06:47:40.927-05:00 | [Stage 9 (mapToPair at
BaseSparkCommitActionExecutor.java:209):> (110 + 7) / 117]
| 2022-03-15T06:47:58.272-05:00 | [Stage 12 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (1 + 3) / 4]
| 2022-03-15T06:48:04.654-05:00 | [Stage 14 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (1 + 19) / 20]
| 2022-03-15T06:48:06.493-05:00 | [Stage 14 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (3 + 17) / 20]
| 2022-03-15T06:48:16.497-05:00 | [Stage 16 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (0 + 100) / 100]
| 2022-03-15T06:48:17.592-05:00 | [Stage 16 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (35 + 65) / 100]
| 2022-03-15T06:48:19.544-05:00 | [Stage 16 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (37 + 63) / 100]
| 2022-03-15T06:48:20.588-05:00 | [Stage 16 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (51 + 49) / 100]
| 2022-03-15T06:48:21.593-05:00 | [Stage 16 (isEmpty at
HoodieSparkSqlWriter.scala:609):======> (77 + 23) / 100]
| 2022-03-15T06:48:23.827-05:00 | [Stage 16 (isEmpty at
HoodieSparkSqlWriter.scala:609):=========> (93 + 7) / 100]
| 2022-03-15T06:48:37.952-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (0 + 117) / 500]
| 2022-03-15T06:48:38.995-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (68 + 117) / 500]
| 2022-03-15T06:48:40.006-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (110 + 117) / 500]
| 2022-03-15T06:48:51.385-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (116 + 117) / 500]
| 2022-03-15T06:48:52.411-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):=> (155 + 117) / 500]
| 2022-03-15T06:48:53.421-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (214 + 117) / 500]
| 2022-03-15T06:48:55.455-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (231 + 116) / 500]
| 2022-03-15T06:49:04.055-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 500]
| 2022-03-15T06:49:05.162-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (237 + 117) / 500]
| 2022-03-15T06:49:06.169-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):===> (287 + 117) / 500]
| 2022-03-15T06:49:07.176-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (320 + 117) / 500]
| 2022-03-15T06:49:08.304-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (345 + 117) / 500]
| 2022-03-15T06:49:17.407-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 500]
| 2022-03-15T06:49:18.449-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (349 + 117) / 500]
| 2022-03-15T06:49:19.450-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):=====> (386 + 114) / 500]
| 2022-03-15T06:49:20.457-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):======> (421 + 79) / 500]
| 2022-03-15T06:49:21.528-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):=======> (452 + 48) / 500]
| 2022-03-15T06:49:24.903-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):=======> (465 + 35) / 500]
| 2022-03-15T06:49:26.307-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):=======> (466 + 34) / 500]
| 2022-03-15T06:49:27.724-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):=======> (470 + 30) / 500]
| 2022-03-15T06:49:31.513-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):=======> (481 + 19) / 500]
| 2022-03-15T06:49:32.746-05:00 | [Stage 18 (isEmpty at
HoodieSparkSqlWriter.scala:609):========> (496 + 4) / 500]
| 2022-03-15T06:49:46.207-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (0 + 117) / 473]
| 2022-03-15T06:49:47.211-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (61 + 117) / 473]
| 2022-03-15T06:49:48.305-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (114 + 117) / 473]
| 2022-03-15T06:49:59.447-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):> (116 + 117) / 473]
| 2022-03-15T06:50:00.457-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):=> (158 + 117) / 473]
| 2022-03-15T06:50:01.463-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (205 + 117) / 473]
| 2022-03-15T06:50:02.652-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (231 + 117) / 473]
| 2022-03-15T06:50:12.642-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):==> (232 + 117) / 473]
| 2022-03-15T06:50:13.642-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):===> (261 + 117) / 473]
| 2022-03-15T06:50:14.654-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (300 + 117) / 473]
| 2022-03-15T06:50:15.668-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (335 + 117) / 473]
| 2022-03-15T06:50:17.172-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (347 + 117) / 473]
| 2022-03-15T06:50:25.789-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):====> (348 + 116) / 473]
| 2022-03-15T06:50:26.840-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):=====> (372 + 101) / 473]
| 2022-03-15T06:50:27.869-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):======> (406 + 67) / 473]
| 2022-03-15T06:50:28.903-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):=======> (445 + 28) / 473]
| 2022-03-15T06:50:31.957-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):========> (465 + 8) / 473]
| 2022-03-15T06:50:33.009-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):========> (468 + 5) / 473]
| 2022-03-15T06:50:34.267-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):========> (470 + 3) / 473]
| 2022-03-15T06:50:35.294-05:00 | [Stage 20 (isEmpty at
HoodieSparkSqlWriter.scala:609):=========>(473 + 0) / 473]
| 2022-03-15T06:50:36.433-05:00 | [Stage 22 (collect at
SparkRDDWriteClient.java:123):=====> (773 + 115) / 1098]
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]