lamber-ken edited a comment on issue #1491: [SUPPORT] OutOfMemoryError during 
upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-611116577
 
 
   hi @tverdokhlebd, it works fine in my local env, just some warning, no OOM
   
   ```
   20/04/09 02:07:32 WARN BlockManager: Persisting block rdd_48_33 to disk 
instead.
   [Stage 16:==================================================>     (36 + 4) / 
40]20/04/09 02:07:38 WARN MemoryStore: Not enough space to cache rdd_48_37 in 
memory! (computed 58.0 MB so far)
   20/04/09 02:07:38 WARN BlockManager: Persisting block rdd_48_37 to disk 
instead.
   [Stage 16:===================================================>    (37 + 3) / 
40]20/04/09 02:07:38 WARN MemoryStore: Not enough space to cache rdd_48_38 in 
memory! (computed 39.2 MB so far)
   20/04/09 02:07:38 WARN BlockManager: Persisting block rdd_48_38 to disk 
instead.
   20/04/09 02:07:38 WARN MemoryStore: Not enough space to cache rdd_48_39 in 
memory! (computed 11.5 MB so far)
   20/04/09 02:07:38 WARN BlockManager: Persisting block rdd_48_39 to disk 
instead.
   20/04/09 02:10:27 WARN DefaultSource: Snapshot view not supported yet via 
data source, for MERGE_ON_READ tables. Please query the Hive table registered 
using Spark SQL.
   
   scala> spark.read.format("org.apache.hudi").load(basePath + 
"/2020-03-19/*").count();
   20/04/09 02:22:07 WARN DefaultSource: Snapshot view not supported yet via 
data source, for MERGE_ON_READ tables. Please query the Hive table registered 
using Spark SQL.
   ```
   
   ### Upsert commnad
   ```
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_table"
   
   var inputDF = spark.read.format("csv").option("header", 
"true").load("file:///work/hudi-debug/2.csv")
   
   val hudiOptions = Map[String,String](
     "hoodie.insert.shuffle.parallelism" -> "10",
     "hoodie.upsert.shuffle.parallelism" -> "10",
     "hoodie.delete.shuffle.parallelism" -> "10",
     "hoodie.bulkinsert.shuffle.parallelism" -> "10",
     "hoodie.datasource.write.recordkey.field" -> "tds_cid",
     "hoodie.datasource.write.partitionpath.field" -> "hit_date", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "hit_timestamp",
     "hoodie.datasource.write.operation" -> "upsert",
     "hoodie.memory.merge.max.size" -> "2004857600000"
   )
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Append").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + 
"/2020-03-19/*").count();
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to