vinothchandar commented on issue #1328: Hudi upsert hangs URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-585991857 There must be something else going on.. just used my own benchmark jobs to generate a pattern where the records are fully overwritten in a second (and a third) batch and it actually finishes fine.. ``` hudi:hoodie_benchmark->connect --path file:///tmp/hudi-benchmark/output/org.apache.hudi 35394 [Spring Shell] INFO org.apache.hudi.common.table.HoodieTableMetaClient - Loading HoodieTableMetaClient from file:///tmp/hudi-benchmark/output/org.apache.hudi 35415 [Spring Shell] INFO org.apache.hudi.common.util.FSUtils - Hadoop Configuration: fs.defaultFS: [file:///], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [org.apache.hadoop.fs.LocalFileSystem@6851d345] 35416 [Spring Shell] INFO org.apache.hudi.common.table.HoodieTableConfig - Loading table properties from file:/tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/hoodie.properties 35416 [Spring Shell] INFO org.apache.hudi.common.table.HoodieTableMetaClient - Finished Loading Table of type COPY_ON_WRITE(version=1) from file:///tmp/hudi-benchmark/output/org.apache.hudi Metadata for table hoodie_benchmark loaded hudi:hoodie_benchmark->commits show 36774 [Spring Shell] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200213134159__clean__COMPLETED], [20200213134159__commit__COMPLETED], [20200213134410__clean__COMPLETED], [20200213134410__commit__COMPLETED], [20200213134548__clean__COMPLETED], [20200213134548__commit__COMPLETED]] ╔════════════════╤═════════════════════╤═══════════════════╤═════════════════════╤══════════════════════════╤═══════════════════════╤══════════════════════════════╤══════════════╗ ║ CommitTime │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║ ╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣ ║ 20200213134548 │ 384.8 MB │ 0 │ 34 │ 3 │ 4080024 │ 1211376 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20200213134410 │ 379.9 MB │ 0 │ 34 │ 3 │ 4040016 │ 1199234 │ 0 ║ ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢ ║ 20200213134159 │ 374.8 MB │ 34 │ 0 │ 3 │ 4000008 │ 0 │ 0 ║ ╚════════════════╧═════════════════════╧═══════════════════╧═════════════════════╧══════════════════════════╧═══════════════════════╧══════════════════════════════╧══════════════╝ hudi:hoodie_benchmark-> ``` and the times below in ms ``` grep -n -e totalCreateTime -e totalUpsertTime /tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/*.commit /tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134159.commit:697: "totalCreateTime" : 195060, /tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134159.commit:698: "totalUpsertTime" : 0, /tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134410.commit:697: "totalCreateTime" : 0, /tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134410.commit:698: "totalUpsertTime" : 193693, /tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134548.commit:697: "totalCreateTime" : 0, /tmp/hudi-benchmark/output/org.apache.hudi/.hoodie/20200213134548.commit:698: "totalUpsertTime" : 182277, ``` Can we drill into your dataset? are you generating tons of files due to granular partitionining? can you share the spark UI and the hudi cli output like above?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
