[GitHub] [hudi] ssdong edited a comment on issue #2707: [SUPPORT] insert_ovewrite_table failed on archiving

GitBox Sat, 03 Apr 2021 05:05:30 -0700


ssdong edited a comment on issue #2707:
URL: https://github.com/apache/hudi/issues/2707#issuecomment-812856042



   @satishkotha Thanks for the tips! In fact, I was also able to reproduce this 
issue locally on my machine regarding the `insert_overwrite_table` issue that 
@jsbali has raised tickets https://issues.apache.org/jira/browse/HUDI-1739 and 
https://issues.apache.org/jira/browse/HUDI-1740 against with.
   
   For tracking and maybe helping you guys test the fix(in the future), I am 
pasting the script I used for reproducing the issue here. 😅 
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import java.util.UUID
   import java.sql.Timestamp
   
   val tableName = "hudi_date_mor"
   val basePath = "<absolute_path_to_your_hudi_folder>" <---- fill out this 
value to point to your local folder(in absolute path)
   val writeConfigs = Map(
        "hoodie.cleaner.incremental.mode" -> "true",
        "hoodie.insert.shuffle.parallelism" -> "20",
        "hoodie.upsert.shuffle.parallelism" -> "2",
        "hoodie.clean.automatic" -> "false",
        "hoodie.datasource.write.operation" -> "insert_overwrite_table",
        "hoodie.table.name" -> tableName,
        "hoodie.datasource.write.table.type" -> "MERGE_ON_READ",
        "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS",
        "hoodie.keep.max.commits" -> "3",
        "hoodie.cleaner.commits.retained" -> "1",
        "hoodie.keep.min.commits" -> "2",
        "hoodie.compact.inline.max.delta.commits" -> "1"
        )
   
   val dateSMap: Map[Int, String] = Map(
       0-> "2020-07-01",
       1-> "2020-08-01",
       2-> "2020-09-01",
   )
   val dateMap: Map[Int, Timestamp] = Map(
       0-> Timestamp.valueOf("2010-07-01 11:00:15"),
       1-> Timestamp.valueOf("2010-08-01 11:00:15"),
       2-> Timestamp.valueOf("2010-09-01 11:00:15"),
   )
   var seq = Seq(
       (0, "value", dateMap(0), dateSMap(0), UUID.randomUUID.toString)
   )
   for(i <- 501 to 1000) {
       seq :+= (i, "value", dateMap(i % 3), dateSMap(i % 3), 
UUID.randomUUID.toString)
   }
   val df = seq.toDF("id", "string_column", "timestamp_column", "date_string", 
"uuid")
   ```
   
   Run the spark shell(the one taken from hudi quick start page and I am using 
spark version `spark-3.0.1-bin-hadoop2.7`):
   ```
   ./spark-shell --packages 
org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   Copy the above script in there and hit 
`df.write.format("hudi").options(writeConfigs).mode(Overwrite).save(basePath)` 
4 times and on the fifth time, it throws the anticipated `Caused by: 
java.lang.IllegalArgumentException: Positive number of partitions required` 
issue.
   
   Now, the thing is, we _still_ have to _manually_ delete the first commit 
file, which contains the empty `partitionToReplaceFileIds`; otherwise, it would 
still keep throwing the `Positive number of partitions required issue. error.`
   The `"hoodie.embed.timeline.server" -> "false"` _does_ help as it forces the 
write to refresh its timeline so we wouldn't see the second error again, which 
is 
   ```
   java.io.FileNotFoundException: 
<path_to_hoodie_folder>/.hoodie/20210403201659.replacecommit does not exist
   ```
   However, it appears `"hoodie.embed.timeline.server" -> "false"` to be not 
_quite_ necessary since the _6th_ time we write, the writer is automatically 
being refreshed with the _newest_ timeline and it will put all `*replacecommit` 
files back to a status of integrity again. 
   
   If we fix the empty `partitionToReplaceFileIds` issue, we might not need to 
dig into the `replacecommit does not exist` issue anymore since it is caused by 
the workaround of _manually_ deleting the empty commit file. It would fix 
everything from the start. However, I would still be curious to learn about 
_why_ we would need a `reset` of the timeline server within the `close` action 
upon the `HoodieTableFileSystemView`. It appears unnecessary to me and could be 
removed if there is no strong reason behind it. 
   
   The `reset` within `close` was originally introduced in #600 after a bit of 
digging in that code. I hope that helps you narrow down the scope a little bit. 
Maybe @bvaradar could explain it if the memory is still fresh to you since that 
PR is about 2 years ago from now. 😅 Thanks.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ssdong edited a comment on issue #2707: [SUPPORT] insert_ovewrite_table failed on archiving

Reply via email to