lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-611288468
 
 
   Hi @tverdokhlebd, the whole upsert cost about 30min
   ```
   dcadmin-imac:hudi-debug dcadmin$ export 
SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   dcadmin-imac:hudi-debug dcadmin$ ${SPARK_HOME}/bin/spark-shell \
   >     --driver-memory 6G \
   >     --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
   >     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   Ivy Default Cache set to: /Users/dcadmin/.ivy2/cache
   The jars for the packages stored in: /Users/dcadmin/.ivy2/jars
   :: loading settings :: url = 
jar:file:/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
   org.apache.hudi#hudi-spark-bundle_2.11 added as a dependency
   org.apache.spark#spark-avro_2.11 added as a dependency
   :: resolving dependencies :: 
org.apache.spark#spark-submit-parent-204b7ab2-3e3f-4853-9220-5e038fc99fc2;1.0
        confs: [default]
        found org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating in central
        found org.apache.spark#spark-avro_2.11;2.4.4 in central
        found org.spark-project.spark#unused;1.0.0 in spark-list
   :: resolution report :: resolve 258ms :: artifacts dl 5ms
        :: modules in use:
        org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating from central in 
[default]
        org.apache.spark#spark-avro_2.11;2.4.4 from central in [default]
        org.spark-project.spark#unused;1.0.0 from spark-list in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
        ---------------------------------------------------------------------
   :: retrieving :: 
org.apache.spark#spark-submit-parent-204b7ab2-3e3f-4853-9220-5e038fc99fc2
        confs: [default]
        0 artifacts copied, 3 already retrieved (0kB/5ms)
   20/04/09 09:23:38 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   Spark context Web UI available at http://10.101.52.18:4040
   Spark context available as 'sc' (master = local[*], app id = 
local-1586395425370).
   Spark session available as 'spark'.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
         /_/
            
   Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_211)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> 
   
   scala> import org.apache.spark.sql.functions._
   import org.apache.spark.sql.functions._
   
   scala> 
   
   scala> val tableName = "hudi_mor_table"
   tableName: String = hudi_mor_table
   
   scala> val basePath = "file:///tmp/hudi_mor_table"
   basePath: String = file:///tmp/hudi_mor_table
   
   scala> 
   
   scala> var inputDF = spark.read.format("csv").option("header", 
"true").load("file:///work/hudi-debug/2.csv")
   inputDF: org.apache.spark.sql.DataFrame = [tds_cid: string, hit_timestamp: 
string ... 46 more fields]
   
   scala> 
   
   scala> val hudiOptions = Map[String,String](
        |   "hoodie.insert.shuffle.parallelism" -> "10",
        |   "hoodie.upsert.shuffle.parallelism" -> "10",
        |   "hoodie.delete.shuffle.parallelism" -> "10",
        |   "hoodie.bulkinsert.shuffle.parallelism" -> "10",
        |   "hoodie.datasource.write.recordkey.field" -> "tds_cid",
        |   "hoodie.datasource.write.partitionpath.field" -> "hit_date", 
        |   "hoodie.table.name" -> tableName,
        |   "hoodie.datasource.write.precombine.field" -> "hit_timestamp",
        |   "hoodie.datasource.write.operation" -> "upsert",
        |   "hoodie.memory.merge.max.size" -> "2004857600000",
        |   "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS",
        |   "hoodie.cleaner.fileversions.retained" -> "1"
        | )
   hudiOptions: scala.collection.immutable.Map[String,String] = 
Map(hoodie.insert.shuffle.parallelism -> 10, 
hoodie.datasource.write.precombine.field -> hit_timestamp, 
hoodie.cleaner.fileversions.retained -> 1, hoodie.delete.shuffle.parallelism -> 
10, hoodie.datasource.write.operation -> upsert, 
hoodie.datasource.write.recordkey.field -> tds_cid, hoodie.table.name -> 
hudi_mor_table, hoodie.memory.merge.max.size -> 2004857600000, 
hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS, 
hoodie.upsert.shuffle.parallelism -> 10, 
hoodie.datasource.write.partitionpath.field -> hit_date, 
hoodie.bulkinsert.shuffle.parallelism -> 10)
   
   scala> 
   
   scala> inputDF.write.format("org.apache.hudi").
        |   options(hudiOptions).
        |   mode("Append").
        |   save(basePath)
   20/04/09 09:23:53 WARN Utils: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.debug.maxToStringFields' in SparkEnv.conf.
   [Stage 3:>                                                         (0 + 4) / 
10]20/04/09 09:24:55 WARN MemoryStore: Not enough space to cache rdd_16_1 in 
memory! (computed 87.6 MB so far)
   20/04/09 09:24:55 WARN BlockManager: Persisting block rdd_16_1 to disk 
instead.
   20/04/09 09:24:56 WARN MemoryStore: Not enough space to cache rdd_16_2 in 
memory! (computed 196.9 MB so far)
   20/04/09 09:24:56 WARN BlockManager: Persisting block rdd_16_2 to disk 
instead.
   20/04/09 09:24:56 WARN MemoryStore: Not enough space to cache rdd_16_0 in 
memory! (computed 130.8 MB so far)
   20/04/09 09:24:56 WARN BlockManager: Persisting block rdd_16_0 to disk 
instead.
   20/04/09 09:24:56 WARN MemoryStore: Not enough space to cache rdd_16_3 in 
memory! (computed 196.3 MB so far)
   20/04/09 09:24:56 WARN BlockManager: Persisting block rdd_16_3 to disk 
instead.
   [Stage 3:=======================>                                  (4 + 4) / 
10]20/04/09 09:25:12 WARN MemoryStore: Not enough space to cache rdd_16_6 in 
memory! (computed 58.5 MB so far)
   20/04/09 09:25:12 WARN BlockManager: Persisting block rdd_16_6 to disk 
instead.
   20/04/09 09:25:12 WARN MemoryStore: Not enough space to cache rdd_16_4 in 
memory! (computed 58.3 MB so far)
   20/04/09 09:25:12 WARN BlockManager: Persisting block rdd_16_4 to disk 
instead.
   20/04/09 09:25:12 WARN MemoryStore: Not enough space to cache rdd_16_5 in 
memory! (computed 87.6 MB so far)
   20/04/09 09:25:12 WARN BlockManager: Persisting block rdd_16_5 to disk 
instead.
   20/04/09 09:25:12 WARN MemoryStore: Failed to reserve initial memory 
threshold of 1024.0 KB for computing block rdd_16_7 in memory.
   20/04/09 09:25:12 WARN MemoryStore: Not enough space to cache rdd_16_7 in 
memory! (computed 0.0 B so far)
   20/04/09 09:25:12 WARN BlockManager: Persisting block rdd_16_7 to disk 
instead.
   [Stage 3:==============================================>           (8 + 2) / 
10]20/04/09 09:25:26 WARN MemoryStore: Not enough space to cache rdd_16_9 in 
memory! (computed 58.4 MB so far)
   20/04/09 09:25:26 WARN BlockManager: Persisting block rdd_16_9 to disk 
instead.
   20/04/09 09:25:26 WARN MemoryStore: Not enough space to cache rdd_16_8 in 
memory! (computed 87.6 MB so far)
   20/04/09 09:25:26 WARN BlockManager: Persisting block rdd_16_8 to disk 
instead.
   [Stage 14:=======================================>                (28 + 4) / 
40]20/04/09 09:44:14 WARN MemoryStore: Not enough space to cache rdd_44_30 in 
memory! (computed 39.3 MB so far)
   20/04/09 09:44:14 WARN BlockManager: Persisting block rdd_44_30 to disk 
instead.
   20/04/09 09:44:14 WARN MemoryStore: Not enough space to cache rdd_44_29 in 
memory! (computed 58.5 MB so far)
   20/04/09 09:44:14 WARN BlockManager: Persisting block rdd_44_29 to disk 
instead.
   20/04/09 09:44:14 WARN MemoryStore: Not enough space to cache rdd_44_28 in 
memory! (computed 58.6 MB so far)
   20/04/09 09:44:14 WARN BlockManager: Persisting block rdd_44_28 to disk 
instead.
   20/04/09 09:44:14 WARN MemoryStore: Not enough space to cache rdd_44_31 in 
memory! (computed 17.3 MB so far)
   20/04/09 09:44:14 WARN BlockManager: Persisting block rdd_44_31 to disk 
instead.
   [Stage 14:============================================>           (32 + 4) / 
40]20/04/09 09:44:19 WARN MemoryStore: Not enough space to cache rdd_44_33 in 
memory! (computed 26.0 MB so far)
   20/04/09 09:44:19 WARN BlockManager: Persisting block rdd_44_33 to disk 
instead.
   20/04/09 09:44:19 WARN MemoryStore: Not enough space to cache rdd_44_32 in 
memory! (computed 25.8 MB so far)
   20/04/09 09:44:19 WARN BlockManager: Persisting block rdd_44_32 to disk 
instead.
   20/04/09 09:44:19 WARN MemoryStore: Not enough space to cache rdd_44_35 in 
memory! (computed 5.2 MB so far)
   20/04/09 09:44:19 WARN BlockManager: Persisting block rdd_44_35 to disk 
instead.
   20/04/09 09:44:19 WARN MemoryStore: Not enough space to cache rdd_44_34 in 
memory! (computed 5.1 MB so far)
   20/04/09 09:44:19 WARN BlockManager: Persisting block rdd_44_34 to disk 
instead.
   [Stage 14:==================================================>     (36 + 4) / 
40]20/04/09 09:44:26 WARN MemoryStore: Not enough space to cache rdd_44_38 in 
memory! (computed 1032.0 KB so far)
   20/04/09 09:44:26 WARN BlockManager: Persisting block rdd_44_38 to disk 
instead.
   20/04/09 09:44:26 WARN MemoryStore: Not enough space to cache rdd_44_36 in 
memory! (computed 3.4 MB so far)
   20/04/09 09:44:26 WARN MemoryStore: Not enough space to cache rdd_44_37 in 
memory! (computed 3.4 MB so far)
   20/04/09 09:44:26 WARN BlockManager: Persisting block rdd_44_37 to disk 
instead.
   20/04/09 09:44:26 WARN BlockManager: Persisting block rdd_44_36 to disk 
instead.
   20/04/09 09:44:26 WARN MemoryStore: Not enough space to cache rdd_44_39 in 
memory! (computed 7.6 MB so far)
   20/04/09 09:44:26 WARN BlockManager: Persisting block rdd_44_39 to disk 
instead.
   20/04/09 09:47:19 WARN DefaultSource: Snapshot view not supported yet via 
data source, for MERGE_ON_READ tables. Please query the Hive table registered 
using Spark SQL.
   
   scala> 
   
   scala> spark.read.format("org.apache.hudi").load(basePath + 
"/2020-03-19/*").count();
   20/04/09 09:47:19 WARN DefaultSource: Snapshot view not supported yet via 
data source, for MERGE_ON_READ tables. Please query the Hive table registered 
using Spark SQL.
   res1: Long = 5045301                                                         
   
   
   scala> 
   
   scala> 
   ```
   
   
![image](https://user-images.githubusercontent.com/20113411/78850756-8ffb4880-7a4a-11ea-973d-5cb15a0c6c95.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to