lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-611626685
 
 
   I run those operations with `local[2]` and 6GB driver memory, still worked 
fine.
   
   I run those operations on local env not in docker env. Can you run those 
operations in linux env?
   
   ```
   dcadmin-imac:flink-1.6.3.sdk dcadmin$ ${SPARK_HOME}/bin/spark-shell     
--master 'local[2]'     --driver-memory 6G     --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   Ivy Default Cache set to: /Users/dcadmin/.ivy2/cache
   The jars for the packages stored in: /Users/dcadmin/.ivy2/jars
   :: loading settings :: url = 
jar:file:/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
   org.apache.hudi#hudi-spark-bundle_2.11 added as a dependency
   org.apache.spark#spark-avro_2.11 added as a dependency
   :: resolving dependencies :: 
org.apache.spark#spark-submit-parent-e1b9ce1d-8bb9-4da6-a29f-c5287bfad216;1.0
        confs: [default]
        found org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating in central
        found org.apache.spark#spark-avro_2.11;2.4.4 in central
        found org.spark-project.spark#unused;1.0.0 in spark-list
   :: resolution report :: resolve 226ms :: artifacts dl 5ms
        :: modules in use:
        org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating from central in 
[default]
        org.apache.spark#spark-avro_2.11;2.4.4 from central in [default]
        org.spark-project.spark#unused;1.0.0 from spark-list in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
        ---------------------------------------------------------------------
   :: retrieving :: 
org.apache.spark#spark-submit-parent-e1b9ce1d-8bb9-4da6-a29f-c5287bfad216
        confs: [default]
        0 artifacts copied, 3 already retrieved (0kB/5ms)
   20/04/09 23:59:12 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   20/04/09 23:59:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
   Spark context Web UI available at http://10.101.52.18:4041
   Spark context available as 'sc' (master = local[2], app id = 
local-1586447958208).
   Spark session available as 'spark'.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
         /_/
            
   Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
1.8.0_211)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> 
   
   scala> import org.apache.spark.sql.functions._
   import org.apache.spark.sql.functions._
   
   scala> 
   
   scala> val tableName = "hudi_mor_table"
   tableName: String = hudi_mor_table
   
   scala> val basePath = "file:///tmp/hudi_mor_table"
   basePath: String = file:///tmp/hudi_mor_table
   
   scala> 
   
   scala> var inputDF = spark.read.format("csv").option("header", 
"true").load("file:///work/hudi-debug/2.csv")
   inputDF: org.apache.spark.sql.DataFrame = [tds_cid: string, hit_timestamp: 
string ... 46 more fields]
   
   scala> 
   
   scala> val hudiOptions = Map[String,String](
        |   "hoodie.insert.shuffle.parallelism" -> "10",
        |   "hoodie.upsert.shuffle.parallelism" -> "10",
        |   "hoodie.delete.shuffle.parallelism" -> "10",
        |   "hoodie.bulkinsert.shuffle.parallelism" -> "10",
        |   "hoodie.datasource.write.recordkey.field" -> "tds_cid",
        |   "hoodie.datasource.write.partitionpath.field" -> "hit_date", 
        |   "hoodie.table.name" -> tableName,
        |   "hoodie.datasource.write.precombine.field" -> "hit_timestamp",
        |   "hoodie.datasource.write.operation" -> "upsert",
        |   "hoodie.memory.merge.max.size" -> "2004857600000",
        |   "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS",
        |   "hoodie.cleaner.fileversions.retained" -> "1"
        | )
   hudiOptions: scala.collection.immutable.Map[String,String] = 
Map(hoodie.insert.shuffle.parallelism -> 10, 
hoodie.datasource.write.precombine.field -> hit_timestamp, 
hoodie.cleaner.fileversions.retained -> 1, hoodie.delete.shuffle.parallelism -> 
10, hoodie.datasource.write.operation -> upsert, 
hoodie.datasource.write.recordkey.field -> tds_cid, hoodie.table.name -> 
hudi_mor_table, hoodie.memory.merge.max.size -> 2004857600000, 
hoodie.cleaner.policy -> KEEP_LATEST_FILE_VERSIONS, 
hoodie.upsert.shuffle.parallelism -> 10, 
hoodie.datasource.write.partitionpath.field -> hit_date, 
hoodie.bulkinsert.shuffle.parallelism -> 10)
   
   scala> 
   
   scala> inputDF.write.format("org.apache.hudi").
        |   options(hudiOptions).
        |   mode("Append").
        |   save(basePath)
   20/04/10 00:01:07 WARN Utils: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.debug.maxToStringFields' in SparkEnv.conf.
   [Stage 3:==================================>                       (6 + 2) / 
10]20/04/10 00:02:58 WARN MemoryStore: Not enough space to cache rdd_16_6 in 
memory! (computed 58.5 MB so far)
   20/04/10 00:02:58 WARN BlockManager: Persisting block rdd_16_6 to disk 
instead.
   20/04/10 00:02:58 WARN MemoryStore: Not enough space to cache rdd_16_7 in 
memory! (computed 58.1 MB so far)
   20/04/10 00:02:58 WARN BlockManager: Persisting block rdd_16_7 to disk 
instead.
   [Stage 3:==============================================>           (8 + 2) / 
10]20/04/10 00:03:12 WARN MemoryStore: Not enough space to cache rdd_16_8 in 
memory! (computed 17.3 MB so far)
   20/04/10 00:03:12 WARN BlockManager: Persisting block rdd_16_8 to disk 
instead.
   20/04/10 00:03:12 WARN MemoryStore: Not enough space to cache rdd_16_9 in 
memory! (computed 131.4 MB so far)
   20/04/10 00:03:12 WARN BlockManager: Persisting block rdd_16_9 to disk 
instead.
   [Stage 14:============================================>           (32 + 2) / 
40]20/04/10 00:30:38 WARN MemoryStore: Not enough space to cache rdd_44_32 in 
memory! (computed 58.3 MB so far)
   20/04/10 00:30:38 WARN BlockManager: Persisting block rdd_44_32 to disk 
instead.
   20/04/10 00:30:38 WARN MemoryStore: Not enough space to cache rdd_44_33 in 
memory! (computed 58.2 MB so far)
   20/04/10 00:30:38 WARN BlockManager: Persisting block rdd_44_33 to disk 
instead.
   [Stage 14:=================================================>      (35 + 2) / 
40]20/04/10 00:30:42 WARN MemoryStore: Not enough space to cache rdd_44_35 in 
memory! (computed 11.6 MB so far)
   20/04/10 00:30:42 WARN BlockManager: Persisting block rdd_44_35 to disk 
instead.
   20/04/10 00:30:43 WARN MemoryStore: Not enough space to cache rdd_44_36 in 
memory! (computed 11.6 MB so far)
   20/04/10 00:30:43 WARN BlockManager: Persisting block rdd_44_36 to disk 
instead.
   [Stage 14:=====================================================>  (38 + 2) / 
40]20/04/10 00:30:51 WARN MemoryStore: Not enough space to cache rdd_44_38 in 
memory! (computed 58.8 MB so far)
   20/04/10 00:30:51 WARN BlockManager: Persisting block rdd_44_38 to disk 
instead.
   20/04/10 00:30:51 WARN MemoryStore: Not enough space to cache rdd_44_39 in 
memory! (computed 11.4 MB so far)
   20/04/10 00:30:51 WARN BlockManager: Persisting block rdd_44_39 to disk 
instead.
   20/04/10 00:33:53 WARN DefaultSource: Snapshot view not supported yet via 
data source, for MERGE_ON_READ tables. Please query the Hive table registered 
using Spark SQL.
   
   scala> spark.read.format("org.apache.hudi").load(basePath + 
"/2020-03-19/*").count();
   20/04/10 00:34:10 WARN DefaultSource: Snapshot view not supported yet via 
data source, for MERGE_ON_READ tables. Please query the Hive table registered 
using Spark SQL.
   res1: Long = 5087127  
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to