ssdong edited a comment on issue #2707: URL: https://github.com/apache/hudi/issues/2707#issuecomment-812856042
@satishkotha Thanks for the tips! In fact, I was also able to reproduce this issue locally on my machine regarding the `insert_overwrite_table` issue that @jsbali has raised tickets https://issues.apache.org/jira/browse/HUDI-1739 and https://issues.apache.org/jira/browse/HUDI-1740 against with. For tracking and maybe helping you guys test the fix(in the future), I am pasting the script I used for reproducing the issue here. 😅 ``` import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import java.util.UUID import java.sql.Timestamp val tableName = "hudi_date_mor" val basePath = "<absolute_path_to_your_hudi_folder>" <---- fill out this value to point to your local folder(in absolute path) val writeConfigs = Map( "hoodie.cleaner.incremental.mode" -> "true", "hoodie.insert.shuffle.parallelism" -> "20", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.clean.automatic" -> "false", "hoodie.datasource.write.operation" -> "insert_overwrite_table", "hoodie.table.name" -> tableName, "hoodie.datasource.write.table.type" -> "MERGE_ON_READ", "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS", "hoodie.keep.max.commits" -> "3", "hoodie.cleaner.commits.retained" -> "1", "hoodie.keep.min.commits" -> "2", "hoodie.compact.inline.max.delta.commits" -> "1" ) val dateSMap: Map[Int, String] = Map( 0-> "2020-07-01", 1-> "2020-08-01", 2-> "2020-09-01", ) val dateMap: Map[Int, Timestamp] = Map( 0-> Timestamp.valueOf("2010-07-01 11:00:15"), 1-> Timestamp.valueOf("2010-08-01 11:00:15"), 2-> Timestamp.valueOf("2010-09-01 11:00:15"), ) var seq = Seq( (0, "value", dateMap(0), dateSMap(0), UUID.randomUUID.toString) ) for(i <- 501 to 1000) { seq :+= (i, "value", dateMap(i % 3), dateSMap(i % 3), UUID.randomUUID.toString) } val df = seq.toDF("id", "string_column", "timestamp_column", "date_string", "uuid") ``` Run the spark shell(the one taken from hudi quick start page and I am using spark version `spark-3.0.1-bin-hadoop2.7`): ``` ./spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' ``` Copy the above script in there and hit `df.write.format("hudi").options(writeConfigs).mode(Overwrite).save(basePath)` 4 times and on the fifth time, it throws the anticipated `Caused by: java.lang.IllegalArgumentException: Positive number of partitions required` issue. Now, the thing is, we _still_ have to _manually_ delete the first commit file, which contains the empty `partitionToReplaceFileIds`; otherwise, it would still keep throwing the `Positive number of partitions required issue. error.` The `"hoodie.embed.timeline.server" -> "false"` _does_ help as it forces the write to refresh its timeline so we wouldn't see the second error again, which is ``` java.io.FileNotFoundException: <path_to_hoodie_folder>/.hoodie/20210403201659.replacecommit does not exist ``` However, it appears `"hoodie.embed.timeline.server" -> "false"` to be not _quite_ necessary since the _6th_ time we write, the writer is automatically being refreshed with the _newest_ timeline and it will put all `*replacecommit` files back to a status of integrity again. If we fix the empty `partitionToReplaceFileIds` issue, we might not need to dig into the `replacecommit does not exist` issue anymore since it is caused by the workaround of _manually_ deleting the empty commit file. It would fix everything from the start. However, I would still be curious to learn about _why_ we would need a `reset` of the timeline server within the `close` action upon the `HoodieTableFileSystemView`. It appears unnecessary to me and could be removed if there is no strong reason behind it. The `reset` within `close` was originally introduced in #600 after a bit of digging in that code. I hope that helps you narrow down the scope a little bit. Maybe @bvaradar could explain it if the memory is still fresh to you since that PR is about 2 years ago from now. 😅 Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
