yihua commented on issue #6919:
URL: https://github.com/apache/hudi/issues/6919#issuecomment-1276883734

   @tommy810pp You may use multiple cores per executor for the Spark job.  In 
that case, you should ensure that each executor is allocated enough memory to 
avoid OOM.  For example, if you use 10 m5.4xlarge instances (16 cores per 
instance) in an EMR cluster, you can easily ingest hundreds of GB of data with 
the following setup:
   ```
   ./bin/spark-shell  \
        --master yarn \
        --deploy-mode client \
        --driver-memory 50g \
        --executor-memory 50g \
        --num-executors 10 \
        --executor-cores 16 \
        --jars /home/hadoop/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.kryoserializer.buffer=256m \
        --conf spark.kryoserializer.buffer.max=1024m \
        --conf spark.rdd.compress=true \
        --conf spark.memory.storageFraction=0.8 \
        --conf "spark.driver.defaultJavaOptions=-XX:+UseG1GC" \
        --conf "spark.executor.defaultJavaOptions=-XX:+UseG1GC" \
        --conf spark.ui.proxyBase="" \
        --conf spark.eventLog.enabled=true --conf 
spark.eventLog.dir=hdfs:///var/log/spark/apps \
        --conf spark.sql.hive.convertMetastoreParquet=false \
        --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
        --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to