pablo-statsig opened a new issue, #7528: URL: https://github.com/apache/incubator-gluten/issues/7528
### Backend VL (Velox) ### Bug description Getting the following error when running a job that is reading / writing from an iceberg table on GCS through dataproc. It happens every time with the same error. Im not sure exactly what part is failing with this error. I'm not sure what other details I can provide to make this better to debug let me know if there is anything that I can add. I tried this on the 1.2 branch and on the main branch with the same error both times. # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000706fa72ad550, pid=11773, tid=11872 # # JRE version: OpenJDK Runtime Environment Temurin-11.0.20.1+1 (11.0.20.1+1) (build 11.0.20.1+1) # Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.20.1+1 (11.0.20.1+1, mixed mode, tiered, compressed oops, g1 gc, linux-amd64) # Problematic frame: # C [libstdc++.so.6+0xad550] # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /tmp/2e4a9830-2f94-4f2d-aeb0-7b28280a7434/core.11773) # # An error report file with more information is saved as: # /tmp/2e4a9830-2f94-4f2d-aeb0-7b28280a7434/hs_err_pid11773.log # # If you would like to submit a bug report, please visit: # https://github.com/adoptium/adoptium-support/issues # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. ### Spark version None ### Spark configurations capacity-scheduler:yarn.scheduler.capacity.resource-calculator org.apache.hadoop.yarn.util.resource.DominantResourceCalculator capacity-scheduler:yarn.scheduler.capacity.root.default.ordering-policy fair core:fs.gs.block.size 134217728 core:fs.gs.hierarchical.namespace.folders.enable true core:fs.gs.metadata.cache.enable false core:hadoop.ssl.enabled.protocols TLSv1,TLSv1.1,TLSv1.2 dataproc:dataproc.logging.stackdriver.job.driver.enable true dataproc:pip.packages google-cloud-bigquery-biglake==0.4.8 distcp:mapreduce.map.java.opts -Xmx768m distcp:mapreduce.map.memory.mb 1024 distcp:mapreduce.reduce.java.opts -Xmx768m distcp:mapreduce.reduce.memory.mb 1024 hadoop-env:HADOOP_DATANODE_OPTS -Xmx512m hdfs:dfs.datanode.address 0.0.0.0:9866 hdfs:dfs.datanode.http.address 0.0.0.0:9864 hdfs:dfs.datanode.https.address 0.0.0.0:9865 hdfs:dfs.datanode.ipc.address 0.0.0.0:9867 hdfs:dfs.namenode.handler.count 100 hdfs:dfs.namenode.http-address 0.0.0.0:9870 hdfs:dfs.namenode.https-address 0.0.0.0:9871 hdfs:dfs.namenode.lifeline.rpc-address test-maestro-rollup-offerup-0-m:8050 hdfs:dfs.namenode.secondary.http-address 0.0.0.0:9868 hdfs:dfs.namenode.secondary.https-address 0.0.0.0:9869 hdfs:dfs.namenode.service.handler.count 50 hdfs:dfs.namenode.servicerpc-address test-maestro-rollup-offerup-0-m:8051 mapred-env:HADOOP_JOB_HISTORYSERVER_HEAPSIZE 2048 mapred:mapreduce.job.maps 957 mapred:mapreduce.job.reduce.slowstart.completedmaps 0.95 mapred:mapreduce.job.reduces 319 mapred:mapreduce.jobhistory.recovery.store.class org.apache.hadoop.mapreduce.v2.hs.HistoryServerLeveldbStateStoreService mapred:mapreduce.map.cpu.vcores 1 mapred:mapreduce.map.java.opts -Xmx6293m mapred:mapreduce.map.memory.mb 7867 mapred:mapreduce.reduce.cpu.vcores 1 mapred:mapreduce.reduce.java.opts -Xmx6293m mapred:mapreduce.reduce.memory.mb 7867 mapred:mapreduce.task.io.sort.mb 256 mapred:yarn.app.mapreduce.am.command-opts -Xmx6293m mapred:yarn.app.mapreduce.am.resource.cpu-vcores 1 mapred:yarn.app.mapreduce.am.resource.mb 7867 spark-env:SPARK_DAEMON_MEMORY 2048m spark:spark.cleaner.periodicGC.interval 15min spark:spark.decommission.maxRatio 0.4 spark:spark.driver.maxResultSize 4g spark:spark.driver.memory 4g spark:spark.dynamicAllocation.cachedExecutorIdleTimeout 120s spark:spark.dynamicAllocation.enabled false spark:spark.dynamicAllocation.executorAllocationRatio 0.14 spark:spark.dynamicAllocation.executorIdleTimeout 120s spark:spark.eventLog.rolling.enabled true spark:spark.executor.cores 16 spark:spark.executor.instances 20 spark:spark.executor.memory 96000m spark:spark.executorEnv.OPENBLAS_NUM_THREADS 1 spark:spark.gluten.loadLibFromJar true spark:spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark:spark.history.fs.gs.outputstream.type FLUSHABLE_COMPOSITE spark:spark.history.fs.update.interval 20s spark:spark.jars gs://statsig-spark-lib-multi-region/iceberg-spark-runtime-3.5_2.12-1.5.0.jar,gs://statsig-spark-lib-multi-region/biglake-catalog-iceberg1.5.0-0.1.1-with-dependencies.jar,gs://statsig-spark-lib-multi-region/gluten-velox-bundle-spark3.5_2.12-ubuntu_22.04_x86_64-1.3.0-SNAPSHOT.jar,gs://statsig-spark-lib-multi-region/gluten-thirdparty-lib-ubuntu-22.04-x86_64.jar spark:spark.memory.offHeap.enabled true spark:spark.memory.offHeap.size 19g spark:spark.plugins org.apache.gluten.GlutenPlugin spark:spark.plugins.defaultList com.google.cloud.dataproc.DataprocSparkPlugin spark:spark.scheduler.mode FAIR spark:spark.shuffle.io.maxRetries 8 spark:spark.shuffle.service.enabled false spark:spark.sql.broadcastTimeout 600 spark:spark.sql.cbo.enabled true spark:spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark:spark.sql.optimizer.runtime.bloomFilter.join.pattern.enabled true spark:spark.stage.maxConsecutiveAttempts 8 spark:spark.task.maxFailures 8 spark:spark.ui.port 0 spark:spark.yarn.am.memory 640m yarn-env:YARN_NODEMANAGER_HEAPSIZE 4000 yarn-env:YARN_RESOURCEMANAGER_HEAPSIZE 2048 yarn-env:YARN_TIMELINESERVER_HEAPSIZE 2048 yarn:yarn.nodemanager.address 0.0.0.0:8026 yarn:yarn.nodemanager.resource.cpu-vcores 16 yarn:yarn.nodemanager.resource.memory-mb 125872 yarn:yarn.resourcemanager.decommissioning-nodes-watcher.decommission-if-no-shuffle-data true yarn:yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs 86400 yarn:yarn.scheduler.maximum-allocation-mb 125872 yarn:yarn.scheduler.minimum-allocation-mb 1 ### System information _No response_ ### Relevant logs ```bash 24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg side: ds IS NOT NULL 24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg side: ds = 20006 24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg side: ds IS NOT NULL 24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg side: ds = 20006 24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg side: metric_hash_bucket IN (0, 1, 2, 3, 4, 5) 24/10/14 20:39:08 INFO SnapshotScan: Scanning table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics snapshot 6125503530616774233 created at 2024-10-14T14:22:59.924+00:00 with filter (((ds IS NOT NULL AND ds = (5-digit-int)) AND metric_hash_bucket IN ((1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int))) AND (metric_type IN ((hash-697c954f), (hash-234c9bed), (hash-7dc548bb), (hash-60702892)) OR metric_type STARTS WITH (hash-1a3f5a1f))) 24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally for table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics 24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 1 partition(s) for table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics 24/10/14 20:39:09 INFO SnapshotScan: Scanning table icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 snapshot 6525676217521908493 created at 2024-10-14T20:38:26.059+00:00 with filter ((((exposure_hash IS NOT NULL AND metric_hash IS NOT NULL) AND metric_rollup IS NOT NULL) AND company_id IS NOT NULL) AND total_calculation IS NOT NULL) 24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally for table icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 1140 partition(s) for table icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 24/10/14 20:39:09 INFO SnapshotScan: Scanning table icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 snapshot 6525676217521908493 created at 2024-10-14T20:38:26.059+00:00 with filter ((exposure_hash_no_group IS NOT NULL AND metric_hash IS NOT NULL) AND company_id IS NOT NULL) 24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally for table icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 1140 partition(s) for table icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 24/10/14 20:39:09 INFO SnapshotScan: Scanning table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics snapshot 6125503530616774233 created at 2024-10-14T14:22:59.924+00:00 with filter (((((ds IS NOT NULL AND metric_type IS NOT NULL) AND ds = (5-digit-int)) AND metric_hash_bucket IN ((1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int))) AND (metric_type IS NOT NULL AND metric_type NOT IN ((hash-697c954f), (hash-234c9bed), (hash-7dc548bb), (hash-60702892)))) AND NOT (metric_type STARTS WITH (hash-1a3f5a1f))) 24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally for table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics 24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 1 partition(s) for table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics 24/10/14 20:39:09 INFO SnapshotScan: Scanning table icefireonni_prod_us_east4.offer.dim_maestro_staging_merge_20241010_0 snapshot 8676148312147584490 created at 2024-10-11T14:45:14.652+00:00 with filter ((((((((((ds IS NOT NULL AND company_id IS NOT NULL) AND metric_type IS NOT NULL) AND ds = (5-digit-int)) AND company_id = (hash-0744313f)) AND ((val IS NOT NULL AND total_calculation = (hash-7dc548bb)) OR (denominator IS NOT NULL AND total_calculation = (hash-641e85bb)))) AND metric_hash_bucket IN ((1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int))) AND (metric_type IS NOT NULL AND metric_type NOT IN ((hash-697c954f), (hash-234c9bed), (hash-7dc548bb), (hash-60702892)))) AND NOT (metric_type STARTS WITH (hash-1a3f5a1f))) AND metric_rollup IS NOT NULL) AND total_calculation IS NOT NULL) 24/10/14 20:39:10 INFO BaseDistributedDataScan: Planning file tasks locally for table icefireonni_prod_us_east4.offer.dim_maestro_staging_merge_20241010_0 24/10/14 20:39:10 INFO SparkPartitioningAwareScan: Reporting UnknownPartitioning with 3600 partition(s) for table icefireonni_prod_us_east4.offer.dim_maestro_staging_merge_20241010_0 24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on node Exchange. 24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on node Exchange. 24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on node Exchange. 24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on node Exchange. 24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on node Exchange. 24/10/14 20:39:13 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on node Exchange. 24/10/14 20:39:13 WARN GlutenFallbackReporter: Validation failed for plan: Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on node Exchange. 24/10/14 20:39:17 WARN DAGScheduler: Broadcasting large task binary with size 1342.3 KiB 24/10/14 20:39:32 INFO RequestTracker: Detected high latency for [url=https://storage.googleapis.com/upload/storage/v1/b/dataproc-temp-us-east4-916951520157-nbvls5dr/o?ifGenerationMatch=0&name=3e18e8c9-fa20-4014-85bc-8f84149cac4f/spark-job-history/eventlog_v2_application_1728938071821_0001/_GHFS_SYNC_TMP_FILE_events_1_application_1728938071821_0001.11.0cc1edbb-37d0-4fc8-99ff-9b4001430b2e&uploadType=resumable&upload_id=AHmUCY0V93SbWXKJPaWbUov30tRPsagIb0FyImXigiW6L86XyLqeZf73pSYsEGje_tiSdZ6euuTF5bK9Jek-0J902Bp4v3SQFGSvzDjcEHRVI-MS_w; invocationId=gccl-invocation-id/3e9e18b2-3656-43d9-8aae-a497c2338eae]. durationMs=249; method=PUT [CONTEXT ratelimit_period="10 SECONDS" ] 24/10/14 20:39:32 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-us-east4-916951520157-nbvls5dr/3e18e8c9-fa20-4014-85bc-8f84149cac4f/spark-job-history/eventlog_v2_application_1728938071821_0001/events_1_application_1728938071821_0001 [CONTEXT ratelimit_period="1 MINUTES [skipped: 77]" ] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
