lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-611626685 I run those operations with `local[2]` and 6GB driver memory, still worked fine. I run those operations on local env not in docker env. Can you run those operations in linux env? ``` dcadmin-imac:flink-1.6.3.sdk dcadmin$ ${SPARK_HOME}/bin/spark-shell --master 'local[2]' --driver-memory 6G --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' Ivy Default Cache set to: /Users/dcadmin/.ivy2/cache The jars for the packages stored in: /Users/dcadmin/.ivy2/jars :: loading settings :: url = jar:file:/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml org.apache.hudi#hudi-spark-bundle_2.11 added as a dependency org.apache.spark#spark-avro_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-e1b9ce1d-8bb9-4da6-a29f-c5287bfad216;1.0 confs: [default] found org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating in central found org.apache.spark#spark-avro_2.11;2.4.4 in central found org.spark-project.spark#unused;1.0.0 in spark-list :: resolution report :: resolve 226ms :: artifacts dl 5ms :: modules in use: org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating from central in [default] org.apache.spark#spark-avro_2.11;2.4.4 from central in [default] org.spark-project.spark#unused;1.0.0 from spark-list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 3 | 0 | 0 | 0 || 3 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-e1b9ce1d-8bb9-4da6-a29f-c5287bfad216 confs: [default] 0 artifacts copied, 3 already retrieved (0kB/5ms) 20/04/09 23:59:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/04/09 23:59:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Spark context Web UI available at http://10.101.52.18:4041 Spark context available as 'sc' (master = local[2], app id = local-1586447958208). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211) Type in expressions to have them evaluated. Type :help for more information. scala> scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> scala> val tableName = "hudi_mor_table" tableName: String = hudi_mor_table scala> val basePath = "file:///tmp/hudi_mor_table" basePath: String = file:///tmp/hudi_mor_table scala> scala> var inputDF = spark.read.format("csv").option("header", "true").load("file:///work/hudi-debug/2.csv") inputDF: org.apache.spark.sql.DataFrame = [tds_cid: string, hit_timestamp: string ... 46 more fields] scala> scala> val hudiOptions = Map[String,String]( | "hoodie.insert.shuffle.parallelism" -> "10", | "hoodie.upsert.shuffle.parallelism" -> "10", | "hoodie.delete.shuffle.parallelism" -> "10", | "hoodie.bulkinsert.shuffle.parallelism" -> "10", | "hoodie.datasource.write.recordkey.field" -> "tds_cid", | "hoodie.datasource.write.partitionpath.field" -> "hit_date", | "hoodie.table.name" -> tableName, | "hoodie.datasource.write.precombine.field" -> "hit_timestamp", | "hoodie.datasource.write.operation" -> "upsert", | "hoodie.memory.merge.max.size" -> "2004857600000", | "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS", | "hoodie.cleaner.fileversions.retained" -> "1" | ) hudiOptions: scala.collection.immutable.Map[String,String] = Map(hoodie.insert.shuffle.parallelism -> 10, hoodie.datasource.write.precombine.field -> hit_timestamp, hoodie.cleaner.fileversions.retained -> 1, hoodie.delete.shuffle.parallelism -> 10, hoodie.datasource.write.operation -> upsert, hoodie.datasource.write.recordkey.field -> tds_cid, hoodie.table.name -> hudi_mor_table, hoodie.memory.merge.max.size -> 2004857600000, hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS, hoodie.upsert.shuffle.parallelism -> 10, hoodie.datasource.write.partitionpath.field -> hit_date, hoodie.bulkinsert.shuffle.parallelism -> 10) scala> scala> inputDF.write.format("org.apache.hudi"). | options(hudiOptions). | mode("Append"). | save(basePath) 20/04/10 00:01:07 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. [Stage 3:==================================> (6 + 2) / 10]20/04/10 00:02:58 WARN MemoryStore: Not enough space to cache rdd_16_6 in memory! (computed 58.5 MB so far) 20/04/10 00:02:58 WARN BlockManager: Persisting block rdd_16_6 to disk instead. 20/04/10 00:02:58 WARN MemoryStore: Not enough space to cache rdd_16_7 in memory! (computed 58.1 MB so far) 20/04/10 00:02:58 WARN BlockManager: Persisting block rdd_16_7 to disk instead. [Stage 3:==============================================> (8 + 2) / 10]20/04/10 00:03:12 WARN MemoryStore: Not enough space to cache rdd_16_8 in memory! (computed 17.3 MB so far) 20/04/10 00:03:12 WARN BlockManager: Persisting block rdd_16_8 to disk instead. 20/04/10 00:03:12 WARN MemoryStore: Not enough space to cache rdd_16_9 in memory! (computed 131.4 MB so far) 20/04/10 00:03:12 WARN BlockManager: Persisting block rdd_16_9 to disk instead. [Stage 14:============================================> (32 + 2) / 40]20/04/10 00:30:38 WARN MemoryStore: Not enough space to cache rdd_44_32 in memory! (computed 58.3 MB so far) 20/04/10 00:30:38 WARN BlockManager: Persisting block rdd_44_32 to disk instead. 20/04/10 00:30:38 WARN MemoryStore: Not enough space to cache rdd_44_33 in memory! (computed 58.2 MB so far) 20/04/10 00:30:38 WARN BlockManager: Persisting block rdd_44_33 to disk instead. [Stage 14:=================================================> (35 + 2) / 40]20/04/10 00:30:42 WARN MemoryStore: Not enough space to cache rdd_44_35 in memory! (computed 11.6 MB so far) 20/04/10 00:30:42 WARN BlockManager: Persisting block rdd_44_35 to disk instead. 20/04/10 00:30:43 WARN MemoryStore: Not enough space to cache rdd_44_36 in memory! (computed 11.6 MB so far) 20/04/10 00:30:43 WARN BlockManager: Persisting block rdd_44_36 to disk instead. [Stage 14:=====================================================> (38 + 2) / 40]20/04/10 00:30:51 WARN MemoryStore: Not enough space to cache rdd_44_38 in memory! (computed 58.8 MB so far) 20/04/10 00:30:51 WARN BlockManager: Persisting block rdd_44_38 to disk instead. 20/04/10 00:30:51 WARN MemoryStore: Not enough space to cache rdd_44_39 in memory! (computed 11.4 MB so far) 20/04/10 00:30:51 WARN BlockManager: Persisting block rdd_44_39 to disk instead. 20/04/10 00:33:53 WARN DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL. scala> spark.read.format("org.apache.hudi").load(basePath + "/2020-03-19/*").count(); 20/04/10 00:34:10 WARN DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL. res1: Long = 5087127 ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services