lamber-ken edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-611116577 hi @tverdokhlebd, it works fine in my local env, just some warning, no OOM ``` 20/04/09 02:07:32 WARN BlockManager: Persisting block rdd_48_33 to disk instead. [Stage 16:==================================================> (36 + 4) / 40]20/04/09 02:07:38 WARN MemoryStore: Not enough space to cache rdd_48_37 in memory! (computed 58.0 MB so far) 20/04/09 02:07:38 WARN BlockManager: Persisting block rdd_48_37 to disk instead. [Stage 16:===================================================> (37 + 3) / 40]20/04/09 02:07:38 WARN MemoryStore: Not enough space to cache rdd_48_38 in memory! (computed 39.2 MB so far) 20/04/09 02:07:38 WARN BlockManager: Persisting block rdd_48_38 to disk instead. 20/04/09 02:07:38 WARN MemoryStore: Not enough space to cache rdd_48_39 in memory! (computed 11.5 MB so far) 20/04/09 02:07:38 WARN BlockManager: Persisting block rdd_48_39 to disk instead. 20/04/09 02:10:27 WARN DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL. scala> spark.read.format("org.apache.hudi").load(basePath + "/2020-03-19/*").count(); 20/04/09 02:22:07 WARN DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL. ``` ### Upsert commnad ``` import org.apache.spark.sql.functions._ val tableName = "hudi_mor_table" val basePath = "file:///tmp/hudi_mor_table" var inputDF = spark.read.format("csv").option("header", "true").load("file:///work/hudi-debug/2.csv") val hudiOptions = Map[String,String]( "hoodie.insert.shuffle.parallelism" -> "10", "hoodie.upsert.shuffle.parallelism" -> "10", "hoodie.delete.shuffle.parallelism" -> "10", "hoodie.bulkinsert.shuffle.parallelism" -> "10", "hoodie.datasource.write.recordkey.field" -> "tds_cid", "hoodie.datasource.write.partitionpath.field" -> "hit_date", "hoodie.table.name" -> tableName, "hoodie.datasource.write.precombine.field" -> "hit_timestamp", "hoodie.datasource.write.operation" -> "upsert", "hoodie.memory.merge.max.size" -> "2004857600000" ) inputDF.write.format("org.apache.hudi"). options(hudiOptions). mode("Append"). save(basePath) spark.read.format("org.apache.hudi").load(basePath + "/2020-03-19/*").count(); ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
