[GitHub] [iceberg] rajasinghr opened a new issue, #7832: Implementing Storage Partition join and reducing the time for MERGE command

via GitHub Tue, 13 Jun 2023 12:08:16 -0700


rajasinghr opened a new issue, #7832:
URL: https://github.com/apache/iceberg/issues/7832


   ### Query engine
   
   AWS Glue Spark
   Cluster info:
   Environment: AWS Glue
   Worker type: G 1X (4vCPU and 32GB RAM)
   No of workers: 10
   
   Spark config:
           ```
   ("spark.eventLog.compress", "true"),
           ("spark.eventLog.rolling.enabled", "true"),
           (
               f"spark.sql.catalog.{iceberg_catalog}",
               "org.apache.iceberg.spark.SparkCatalog",
           ),
           (
               f"spark.sql.catalog.{iceberg_catalog}.catalog-impl",
               "org.apache.iceberg.aws.glue.GlueCatalog",
           ),
           (f"spark.sql.catalog.{iceberg_catalog}.glue.skip-archive", "true"),
           (
               f"spark.sql.catalog.{iceberg_catalog}.io-impl",
               "org.apache.iceberg.aws.s3.S3FileIO",
           ),
           (f"spark.sql.catalog.{iceberg_catalog}.warehouse", 
iceberg_warehouse),
           (
               "spark.sql.extensions",
               
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
           ),
           ("spark.sql.iceberg.handle-timestamp-without-timezone", "true"),
           ("spark.sql.iceberg.planning.preserve-data-grouping", "true"),
           ("spark.sql.join.preferSortMergeJoin", "false"),
           ("spark.sql.sources.partitionOverwriteMode", "dynamic"),
           ("spark.sql.sources.v2.bucketing.enabled", "true"),
           ("spark.sql.parquet.enableVectorizedReader", "true"),
           ("spark.sql.autoBroadcastJoinThreshold", 134217728),
           ("spark.sql.shuffle.partitions",255)
   ```
   
   table properties:,
           'write.metadata.delete-after-commit.enabled'='true',
           'history.expire.max-snapshot-age-ms'='604800000',
           'history.expire.min-snapshots-to-keep'='10',
           'commit.manifest.min-count-to-merge'='50',
           'format'='parquet',
           'format-version'='2',
           'write.distribution-mode'='hash'
   
   ### Question
   
   I am running a MERGE between an iceberg table with ~50M rows and temp_view 
with ~30k rows. The merge is taking around 40 mins to complete. I am trying to 
improve the performance of MERGE and have been trying to enable 
storage-partition join but based on my settings the job does a full batch scan 
for the 50M during joins.
   
   How can we avoid the full batch join and fetch only the selected partitions 
during the joins? Also how can we improve the performance of the merge command. 
   
   We are also trying to implement storage partition join but the job does a 
shuffledhash join and scans the whole 50M rows every time. 
   
   Any help would be appreciated. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rajasinghr opened a new issue, #7832: Implementing Storage Partition join and reducing the time for MERGE command

Reply via email to