rajasinghr opened a new issue, #7832:
URL: https://github.com/apache/iceberg/issues/7832
### Query engine
AWS Glue Spark
Cluster info:
Environment: AWS Glue
Worker type: G 1X (4vCPU and 32GB RAM)
No of workers: 10
Spark config:
```
("spark.eventLog.compress", "true"),
("spark.eventLog.rolling.enabled", "true"),
(
f"spark.sql.catalog.{iceberg_catalog}",
"org.apache.iceberg.spark.SparkCatalog",
),
(
f"spark.sql.catalog.{iceberg_catalog}.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog",
),
(f"spark.sql.catalog.{iceberg_catalog}.glue.skip-archive", "true"),
(
f"spark.sql.catalog.{iceberg_catalog}.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO",
),
(f"spark.sql.catalog.{iceberg_catalog}.warehouse",
iceberg_warehouse),
(
"spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
),
("spark.sql.iceberg.handle-timestamp-without-timezone", "true"),
("spark.sql.iceberg.planning.preserve-data-grouping", "true"),
("spark.sql.join.preferSortMergeJoin", "false"),
("spark.sql.sources.partitionOverwriteMode", "dynamic"),
("spark.sql.sources.v2.bucketing.enabled", "true"),
("spark.sql.parquet.enableVectorizedReader", "true"),
("spark.sql.autoBroadcastJoinThreshold", 134217728),
("spark.sql.shuffle.partitions",255)
```
table properties:,
'write.metadata.delete-after-commit.enabled'='true',
'history.expire.max-snapshot-age-ms'='604800000',
'history.expire.min-snapshots-to-keep'='10',
'commit.manifest.min-count-to-merge'='50',
'format'='parquet',
'format-version'='2',
'write.distribution-mode'='hash'
### Question
I am running a MERGE between an iceberg table with ~50M rows and temp_view
with ~30k rows. The merge is taking around 40 mins to complete. I am trying to
improve the performance of MERGE and have been trying to enable
storage-partition join but based on my settings the job does a full batch scan
for the 50M during joins.
How can we avoid the full batch join and fetch only the selected partitions
during the joins? Also how can we improve the performance of the merge command.
We are also trying to implement storage partition join but the job does a
shuffledhash join and scans the whole 50M rows every time.
Any help would be appreciated. Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]