[GitHub] [iceberg] chandu-1101 opened a new issue, #8335: unable to ingest partitioned parquet files as partitioned data into iceberg

via GitHub Wed, 16 Aug 2023 07:11:00 -0700


chandu-1101 opened a new issue, #8335:
URL: https://github.com/apache/iceberg/issues/8335


   ### Apache Iceberg version
   
   1.3.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The issue doesn't happen when the data files in the iceberg are not 
partitioned. 
   
   iceberg version: `org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1`
   Emr: `6.9.0`
   spark: `3.3.0`
   
   1. On Aws S3 we have parquet files partitioned by __created_date_ 
(year-month format). The date rages from 2019-02 -> 2023-08. The folder sizes 
range from 500kb -> 1.3GB
   2. Each row in the parquet is  1kB to 6kB
   3. When we tried to ingest this data into the `iceberg` table 
   i. with partition --> the ingestion fails. The code is below. Tried with 
Executor memory of 1G, 3G, 5G, 10G + 1core. Driver memory as 1G, 1core
   ii. without partition --> the ingestion succeeds with executor memory 2G, 
1Core. Driver 1G 1Core
   
   spark command
   
   ```
   spark-shell  --deploy-mode client  --driver-memory 1g --executor-memory 10g 
--executor-cores 1 --driver-cores 1  --conf spark.sql.adaptive.enabled=true 
--conf spark.sql.adaptive.coalescePartitions.enabled=true --conf 
spark.sql.adaptive.coalescePartitions.minPartitionNum=1 --conf 
spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf 
spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf 
spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf 
spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED --conf 
spark.yarn.maxAppAttempts=1 --conf spark.yarn.maxAppAttempts=1   --conf 
spark.yarn.submit.waitAppCompletion=false  --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
--conf spark.sql.catalog.spark_catalog.type=hive --conf 
spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog --conf 
spark.sql.catalog.local.type=hado
 op  --conf spark.sql.catalog.local.warehouse=$PWD/warehouse  --conf 
"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" --name ravic  
--packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1  --jars 
/home/hadoop/jars2/spark-1.0-SNAPSHOT.jar
   ```
   
   Code
   ```
   
       import com.x.messagingv2.common.Application
       import com.x.messagingv2.utils.SparkUtils
       import org.apache.commons.lang3.ClassUtils.getCanonicalName
       import java.util
       import org.apache.spark.sql.SaveMode
       import org.apache.spark.sql.functions.{col, hash, lit}
   
       val sess = Application.spark();
       val snapshotDf = 
sess.read.parquet("s3://bucket/snapshots2/x11-partitioned/")
       val _snapshotDf = snapshotDf.sortWithinPartitions("__created_date_")
       _snapshotDf.createOrReplaceTempView("snapshot")
       sess.sql(""" CREATE TABLE x11 
                     USING iceberg 
                     TBLPROPERTIES ('key'='_id.oid') 
                     location 's3://bucket/snapshots2/x11-ice2/'  
                     partitioned by (__created_date_)
                     as 
                       select * from snapshot """)
   
   ```
   
   Error
   ```
   23/08/16 13:01:29 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   23/08/16 13:01:29 WARN SQLConf: The SQL config 
'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in 
Spark v3.2 and may be removed in the future. Use 
'spark.sql.parquet.datetimeRebaseModeInRead' instead.
   23/08/16 13:01:29 WARN SQLConf: The SQL config 
'spark.sql.legacy.parquet.datetimeRebaseModeInWrite' has been deprecated in 
Spark v3.2 and may be removed in the future. Use 
'spark.sql.parquet.datetimeRebaseModeInWrite' instead.
   23/08/16 13:01:29 WARN SQLConf: The SQL config 
'spark.sql.legacy.parquet.int96RebaseModeInWrite' has been deprecated in Spark 
v3.2 and may be removed in the future. Use 
'spark.sql.parquet.int96RebaseModeInWrite' instead.
   23/08/16 13:01:29 WARN SQLConf: The SQL config 
'spark.sql.legacy.parquet.int96RebaseModeInRead' has been deprecated in Spark 
v3.2 and may be removed in the future. Use 
'spark.sql.parquet.int96RebaseModeInRead' instead.
   23/08/16 13:01:29 WARN SQLConf: The SQL config 
'spark.sql.adaptive.coalescePartitions.minPartitionNum' has been deprecated in 
Spark v3.2 and may be removed in the future. Use 
'spark.sql.adaptive.coalescePartitions.minPartitionSize' instead.
   23/08/16 13:01:29 INFO HiveConf: Found configuration file 
file:/usr/lib/spark/conf/hive-site.xml
   23/08/16 13:01:29 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   23/08/16 13:01:30 INFO metastore: Trying to connect to metastore with URI 
thrift://ip-172-25-26-218.x.x.local:9083
   23/08/16 13:01:30 INFO metastore: Opened a connection to metastore, current 
connections: 1
   23/08/16 13:01:30 INFO metastore: Connected to metastore.
   #Stage 2:=====================>                                 (348 + 5) / 
910]
   # java.lang.OutOfMemoryError: GC overhead limit exceeded
   # -XX:OnOutOfMemoryError="kill -9 %p"
   #   Executing /bin/sh -c "kill -9 11723"...
   /usr/lib/spark/bin/spark-shell: line 47: 11723 Killed                  
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name 
"Spark shell" "$@"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] chandu-1101 opened a new issue, #8335: unable to ingest partitioned parquet files as partitioned data into iceberg

Reply via email to