chandu-1101 opened a new issue, #8335:
URL: https://github.com/apache/iceberg/issues/8335
### Apache Iceberg version
1.3.1 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
The issue doesn't happen when the data files in the iceberg are not
partitioned.
iceberg version: `org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1`
Emr: `6.9.0`
spark: `3.3.0`
1. On Aws S3 we have parquet files partitioned by __created_date_
(year-month format). The date rages from 2019-02 -> 2023-08. The folder sizes
range from 500kb -> 1.3GB
2. Each row in the parquet is 1kB to 6kB
3. When we tried to ingest this data into the `iceberg` table
i. with partition --> the ingestion fails. The code is below. Tried with
Executor memory of 1G, 3G, 5G, 10G + 1core. Driver memory as 1G, 1core
ii. without partition --> the ingestion succeeds with executor memory 2G,
1Core. Driver 1G 1Core
spark command
```
spark-shell --deploy-mode client --driver-memory 1g --executor-memory 10g
--executor-cores 1 --driver-cores 1 --conf spark.sql.adaptive.enabled=true
--conf spark.sql.adaptive.coalescePartitions.enabled=true --conf
spark.sql.adaptive.coalescePartitions.minPartitionNum=1 --conf
spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf
spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf
spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf
spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED --conf
spark.yarn.maxAppAttempts=1 --conf spark.yarn.maxAppAttempts=1 --conf
spark.yarn.submit.waitAppCompletion=false --conf
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive --conf
spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog --conf
spark.sql.catalog.local.type=hado
op --conf spark.sql.catalog.local.warehouse=$PWD/warehouse --conf
"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" --name ravic
--packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1 --jars
/home/hadoop/jars2/spark-1.0-SNAPSHOT.jar
```
Code
```
import com.x.messagingv2.common.Application
import com.x.messagingv2.utils.SparkUtils
import org.apache.commons.lang3.ClassUtils.getCanonicalName
import java.util
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions.{col, hash, lit}
val sess = Application.spark();
val snapshotDf =
sess.read.parquet("s3://bucket/snapshots2/x11-partitioned/")
val _snapshotDf = snapshotDf.sortWithinPartitions("__created_date_")
_snapshotDf.createOrReplaceTempView("snapshot")
sess.sql(""" CREATE TABLE x11
USING iceberg
TBLPROPERTIES ('key'='_id.oid')
location 's3://bucket/snapshots2/x11-ice2/'
partitioned by (__created_date_)
as
select * from snapshot """)
```
Error
```
23/08/16 13:01:29 WARN HiveConf: HiveConf of name hive.server2.thrift.url
does not exist
23/08/16 13:01:29 WARN SQLConf: The SQL config
'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in
Spark v3.2 and may be removed in the future. Use
'spark.sql.parquet.datetimeRebaseModeInRead' instead.
23/08/16 13:01:29 WARN SQLConf: The SQL config
'spark.sql.legacy.parquet.datetimeRebaseModeInWrite' has been deprecated in
Spark v3.2 and may be removed in the future. Use
'spark.sql.parquet.datetimeRebaseModeInWrite' instead.
23/08/16 13:01:29 WARN SQLConf: The SQL config
'spark.sql.legacy.parquet.int96RebaseModeInWrite' has been deprecated in Spark
v3.2 and may be removed in the future. Use
'spark.sql.parquet.int96RebaseModeInWrite' instead.
23/08/16 13:01:29 WARN SQLConf: The SQL config
'spark.sql.legacy.parquet.int96RebaseModeInRead' has been deprecated in Spark
v3.2 and may be removed in the future. Use
'spark.sql.parquet.int96RebaseModeInRead' instead.
23/08/16 13:01:29 WARN SQLConf: The SQL config
'spark.sql.adaptive.coalescePartitions.minPartitionNum' has been deprecated in
Spark v3.2 and may be removed in the future. Use
'spark.sql.adaptive.coalescePartitions.minPartitionSize' instead.
23/08/16 13:01:29 INFO HiveConf: Found configuration file
file:/usr/lib/spark/conf/hive-site.xml
23/08/16 13:01:29 WARN HiveConf: HiveConf of name hive.server2.thrift.url
does not exist
23/08/16 13:01:30 INFO metastore: Trying to connect to metastore with URI
thrift://ip-172-25-26-218.x.x.local:9083
23/08/16 13:01:30 INFO metastore: Opened a connection to metastore, current
connections: 1
23/08/16 13:01:30 INFO metastore: Connected to metastore.
#Stage 2:=====================> (348 + 5) /
910]
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 11723"...
/usr/lib/spark/bin/spark-shell: line 47: 11723 Killed
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name
"Spark shell" "$@"
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]