Re: [I] [SUPPORT] Process killed with no additional info when loading large parquet files in Spark [hudi]

via GitHub Mon, 19 Feb 2024 10:17:59 -0800


alberttwong commented on issue #10697:
URL: https://github.com/apache/hudi/issues/10697#issuecomment-1952981834


   upgrading from hudi 0.11 to 0.14.1
   
   ```
   [root@spark-hudi bin]# spark-shell --packages 
org.apache.hudi:hudi-spark$SPARK_VERSION-bundle_2.12:0.14.1 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
 --conf 
'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' 
--conf 'spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar' 
--driver-memory 4G
   WARNING: An illegal reflective access operation has occurred
   WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/spark-3.2.1-bin-hadoop3.2/jars/spark-unsafe_2.12-3.2.1.jar) to 
constructor java.nio.DirectByteBuffer(long,int)
   WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
   WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
   WARNING: All illegal access operations will be denied in a future release
   :: loading settings :: url = 
jar:file:/spark-3.2.1-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
   Ivy Default Cache set to: /root/.ivy2/cache
   The jars for the packages stored in: /root/.ivy2/jars
   org.apache.hudi#hudi-spark3.2-bundle_2.12 added as a dependency
   :: resolving dependencies :: 
org.apache.spark#spark-submit-parent-9b4a8c4b-e4e2-4b55-b29b-cacc399b9481;1.0
           confs: [default]
           found org.apache.hudi#hudi-spark3.2-bundle_2.12;0.14.1 in central
   :: resolution report :: resolve 202ms :: artifacts dl 2ms
           :: modules in use:
           org.apache.hudi#hudi-spark3.2-bundle_2.12;0.14.1 from central in 
[default]
           ---------------------------------------------------------------------
           |                  |            modules            ||   artifacts   |
           |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
           ---------------------------------------------------------------------
           |      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
           ---------------------------------------------------------------------
   :: retrieving :: 
org.apache.spark#spark-submit-parent-9b4a8c4b-e4e2-4b55-b29b-cacc399b9481
           confs: [default]
           0 artifacts copied, 1 already retrieved (0kB/7ms)
   24/02/19 18:15:49 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   24/02/19 18:15:57 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
   Spark context Web UI available at http://spark-hudi:4042
   Spark context available as 'sc' (master = local[*], app id = 
local-1708366558050).
   Spark session available as 'spark'.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
         /_/
            
   Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.16.1)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> import org.apache.spark.sql.functions._
   import org.apache.spark.sql.functions._
   
   scala> import org.apache.spark.sql.types._
   import org.apache.spark.sql.types._
   
   scala> import org.apache.spark.sql.Row
   import org.apache.spark.sql.Row
   
   scala> import org.apache.spark.sql.SaveMode._
   import org.apache.spark.sql.SaveMode._
   
   scala> import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceReadOptions._
   
   scala> import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   
   scala> import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   scala> import scala.collection.JavaConversions._
   import scala.collection.JavaConversions._
   
   scala> 
   
   scala> val df = 
spark.read.parquet("s3a://huditest/user_behavior_sample_data.parquet")
   df: org.apache.spark.sql.DataFrame = [UserID: bigint, ItemID: bigint ... 3 
more fields]
   
   scala> 
   
   scala> val databaseName = "hudi_sample"
   databaseName: String = hudi_sample
   
   scala> val tableName = "hudi_coders_hive"
   tableName: String = hudi_coders_hive
   
   scala> val basePath = "s3a://huditest/hudi_coders"
   basePath: String = s3a://huditest/hudi_coders
   
   scala> 
   
   scala> df.write.format("hudi").
        |   option(org.apache.hudi.config.HoodieWriteConfig.TABLE_NAME, 
tableName).
        |   option(RECORDKEY_FIELD_OPT_KEY, "UserID").
        |   option(PRECOMBINE_FIELD_OPT_KEY, "UserID").  
        |   option("hoodie.datasource.hive_sync.enable", "true").
        |   option("hoodie.datasource.hive_sync.mode", "hms").
        |   option("hoodie.datasource.hive_sync.database", databaseName).
        |   option("hoodie.datasource.hive_sync.table", tableName).
        |   option("hoodie.datasource.hive_sync.metastore.uris", 
"thrift://hive-metastore:9083").
        |   option("fs.defaultFS", "s3://huditest/").  
        |   mode(Overwrite).
        |   save(basePath)
   warning: one deprecation; for details, enable `:setting -deprecation' or 
`:replay -deprecation'
   24/02/19 18:16:18 WARN HoodieSparkSqlWriterInternal: hoodie table at 
s3a://huditest/hudi_coders already exists. Deleting existing data & overwriting 
with new data.
   24/02/19 18:16:21 WARN S3ABlockOutputStream: Application invoked the 
Syncable API against stream writing to 
hudi_coders/.hoodie/metadata/files/.files-0000-0_00000000000000010.log.1_0-0-0. 
This is unsupported
   /spark/bin/spark-shell: line 47: 12322 Killed                  
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name 
"Spark shell" "$@"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Process killed with no additional info when loading large parquet files in Spark [hudi]

Reply via email to