[I] Crash onspark 3.5 using GCS + iceberg ubuntu 22.4 [incubator-gluten]

via GitHub Mon, 14 Oct 2024 13:41:58 -0700


pablo-statsig opened a new issue, #7528:
URL: https://github.com/apache/incubator-gluten/issues/7528


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Getting the following error when running a job that is reading / writing 
from an iceberg table on GCS through dataproc. It happens every time with the 
same error. Im not sure exactly what part is failing with this error. I'm not 
sure what other details I can provide to make this better to debug let me know 
if there is anything that I can add. I tried this on the 1.2 branch and on the 
main branch with the same error both times. 
   
   # A fatal error has been detected by the Java Runtime Environment:
   #
   #  SIGSEGV (0xb) at pc=0x0000706fa72ad550, pid=11773, tid=11872
   #
   # JRE version: OpenJDK Runtime Environment Temurin-11.0.20.1+1 (11.0.20.1+1) 
(build 11.0.20.1+1)
   # Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.20.1+1 (11.0.20.1+1, mixed 
mode, tiered, compressed oops, g1 gc, linux-amd64)
   # Problematic frame:
   # C  [libstdc++.so.6+0xad550]
   #
   # Core dump will be written. Default location: Core dumps may be processed 
with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or 
dumping to /tmp/2e4a9830-2f94-4f2d-aeb0-7b28280a7434/core.11773)
   #
   # An error report file with more information is saved as:
   # /tmp/2e4a9830-2f94-4f2d-aeb0-7b28280a7434/hs_err_pid11773.log
   #
   # If you would like to submit a bug report, please visit:
   #   https://github.com/adoptium/adoptium-support/issues
   # The crash happened outside the Java Virtual Machine in native code.
   # See problematic frame for where to report the bug.
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   capacity-scheduler:yarn.scheduler.capacity.resource-calculator
   org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
   capacity-scheduler:yarn.scheduler.capacity.root.default.ordering-policy
   fair
   core:fs.gs.block.size
   134217728
   core:fs.gs.hierarchical.namespace.folders.enable
   true
   core:fs.gs.metadata.cache.enable
   false
   core:hadoop.ssl.enabled.protocols
   TLSv1,TLSv1.1,TLSv1.2
   dataproc:dataproc.logging.stackdriver.job.driver.enable
   true
   dataproc:pip.packages
   google-cloud-bigquery-biglake==0.4.8
   distcp:mapreduce.map.java.opts
   -Xmx768m
   distcp:mapreduce.map.memory.mb
   1024
   distcp:mapreduce.reduce.java.opts
   -Xmx768m
   distcp:mapreduce.reduce.memory.mb
   1024
   hadoop-env:HADOOP_DATANODE_OPTS
   -Xmx512m
   hdfs:dfs.datanode.address
   0.0.0.0:9866
   hdfs:dfs.datanode.http.address
   0.0.0.0:9864
   hdfs:dfs.datanode.https.address
   0.0.0.0:9865
   hdfs:dfs.datanode.ipc.address
   0.0.0.0:9867
   hdfs:dfs.namenode.handler.count
   100
   hdfs:dfs.namenode.http-address
   0.0.0.0:9870
   hdfs:dfs.namenode.https-address
   0.0.0.0:9871
   hdfs:dfs.namenode.lifeline.rpc-address
   test-maestro-rollup-offerup-0-m:8050
   hdfs:dfs.namenode.secondary.http-address
   0.0.0.0:9868
   hdfs:dfs.namenode.secondary.https-address
   0.0.0.0:9869
   hdfs:dfs.namenode.service.handler.count
   50
   hdfs:dfs.namenode.servicerpc-address
   test-maestro-rollup-offerup-0-m:8051
   mapred-env:HADOOP_JOB_HISTORYSERVER_HEAPSIZE
   2048
   mapred:mapreduce.job.maps
   957
   mapred:mapreduce.job.reduce.slowstart.completedmaps
   0.95
   mapred:mapreduce.job.reduces
   319
   mapred:mapreduce.jobhistory.recovery.store.class
   org.apache.hadoop.mapreduce.v2.hs.HistoryServerLeveldbStateStoreService
   mapred:mapreduce.map.cpu.vcores
   1
   mapred:mapreduce.map.java.opts
   -Xmx6293m
   mapred:mapreduce.map.memory.mb
   7867
   mapred:mapreduce.reduce.cpu.vcores
   1
   mapred:mapreduce.reduce.java.opts
   -Xmx6293m
   mapred:mapreduce.reduce.memory.mb
   7867
   mapred:mapreduce.task.io.sort.mb
   256
   mapred:yarn.app.mapreduce.am.command-opts
   -Xmx6293m
   mapred:yarn.app.mapreduce.am.resource.cpu-vcores
   1
   mapred:yarn.app.mapreduce.am.resource.mb
   7867
   spark-env:SPARK_DAEMON_MEMORY
   2048m
   spark:spark.cleaner.periodicGC.interval
   15min
   spark:spark.decommission.maxRatio
   0.4
   spark:spark.driver.maxResultSize
   4g
   spark:spark.driver.memory
   4g
   spark:spark.dynamicAllocation.cachedExecutorIdleTimeout
   120s
   spark:spark.dynamicAllocation.enabled
   false
   spark:spark.dynamicAllocation.executorAllocationRatio
   0.14
   spark:spark.dynamicAllocation.executorIdleTimeout
   120s
   spark:spark.eventLog.rolling.enabled
   true
   spark:spark.executor.cores
   16
   spark:spark.executor.instances
   20
   spark:spark.executor.memory
   96000m
   spark:spark.executorEnv.OPENBLAS_NUM_THREADS
   1
   spark:spark.gluten.loadLibFromJar
   true
   spark:spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
   2
   spark:spark.history.fs.gs.outputstream.type
   FLUSHABLE_COMPOSITE
   spark:spark.history.fs.update.interval
   20s
   spark:spark.jars
   
gs://statsig-spark-lib-multi-region/iceberg-spark-runtime-3.5_2.12-1.5.0.jar,gs://statsig-spark-lib-multi-region/biglake-catalog-iceberg1.5.0-0.1.1-with-dependencies.jar,gs://statsig-spark-lib-multi-region/gluten-velox-bundle-spark3.5_2.12-ubuntu_22.04_x86_64-1.3.0-SNAPSHOT.jar,gs://statsig-spark-lib-multi-region/gluten-thirdparty-lib-ubuntu-22.04-x86_64.jar
   spark:spark.memory.offHeap.enabled
   true
   spark:spark.memory.offHeap.size
   19g
   spark:spark.plugins
   org.apache.gluten.GlutenPlugin
   spark:spark.plugins.defaultList
   com.google.cloud.dataproc.DataprocSparkPlugin
   spark:spark.scheduler.mode
   FAIR
   spark:spark.shuffle.io.maxRetries
   8
   spark:spark.shuffle.service.enabled
   false
   spark:spark.sql.broadcastTimeout
   600
   spark:spark.sql.cbo.enabled
   true
   spark:spark.sql.extensions
   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
   spark:spark.sql.optimizer.runtime.bloomFilter.join.pattern.enabled
   true
   spark:spark.stage.maxConsecutiveAttempts
   8
   spark:spark.task.maxFailures
   8
   spark:spark.ui.port
   0
   spark:spark.yarn.am.memory
   640m
   yarn-env:YARN_NODEMANAGER_HEAPSIZE
   4000
   yarn-env:YARN_RESOURCEMANAGER_HEAPSIZE
   2048
   yarn-env:YARN_TIMELINESERVER_HEAPSIZE
   2048
   yarn:yarn.nodemanager.address
   0.0.0.0:8026
   yarn:yarn.nodemanager.resource.cpu-vcores
   16
   yarn:yarn.nodemanager.resource.memory-mb
   125872
   
yarn:yarn.resourcemanager.decommissioning-nodes-watcher.decommission-if-no-shuffle-data
   true
   yarn:yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs
   86400
   yarn:yarn.scheduler.maximum-allocation-mb
   125872
   yarn:yarn.scheduler.minimum-allocation-mb
   1
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg 
side: ds IS NOT NULL
   24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg 
side: ds = 20006
   24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg 
side: ds IS NOT NULL
   24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg 
side: ds = 20006
   24/10/14 20:39:08 INFO SparkScanBuilder: Evaluating completely on Iceberg 
side: metric_hash_bucket IN (0, 1, 2, 3, 4, 5)
   24/10/14 20:39:08 INFO SnapshotScan: Scanning table 
icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics snapshot 
6125503530616774233 created at 2024-10-14T14:22:59.924+00:00 with filter (((ds 
IS NOT NULL AND ds = (5-digit-int)) AND metric_hash_bucket IN ((1-digit-int), 
(1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int))) AND 
(metric_type IN ((hash-697c954f), (hash-234c9bed), (hash-7dc548bb), 
(hash-60702892)) OR metric_type STARTS WITH (hash-1a3f5a1f)))
   24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally 
for table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics
   24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting 
UnknownPartitioning with 1 partition(s) for table 
icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics
   24/10/14 20:39:09 INFO SnapshotScan: Scanning table 
icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 
snapshot 6525676217521908493 created at 2024-10-14T20:38:26.059+00:00 with 
filter ((((exposure_hash IS NOT NULL AND metric_hash IS NOT NULL) AND 
metric_rollup IS NOT NULL) AND company_id IS NOT NULL) AND total_calculation IS 
NOT NULL)
   24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally 
for table 
icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0
   24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting 
UnknownPartitioning with 1140 partition(s) for table 
icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0
   24/10/14 20:39:09 INFO SnapshotScan: Scanning table 
icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0 
snapshot 6525676217521908493 created at 2024-10-14T20:38:26.059+00:00 with 
filter ((exposure_hash_no_group IS NOT NULL AND metric_hash IS NOT NULL) AND 
company_id IS NOT NULL)
   24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally 
for table 
icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0
   24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting 
UnknownPartitioning with 1140 partition(s) for table 
icefireonni_prod_us_east4.offer.temp_rollup_merged_aggregated_rollup_20241010_0
   24/10/14 20:39:09 INFO SnapshotScan: Scanning table 
icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics snapshot 
6125503530616774233 created at 2024-10-14T14:22:59.924+00:00 with filter 
(((((ds IS NOT NULL AND metric_type IS NOT NULL) AND ds = (5-digit-int)) AND 
metric_hash_bucket IN ((1-digit-int), (1-digit-int), (1-digit-int), 
(1-digit-int), (1-digit-int), (1-digit-int))) AND (metric_type IS NOT NULL AND 
metric_type NOT IN ((hash-697c954f), (hash-234c9bed), (hash-7dc548bb), 
(hash-60702892)))) AND NOT (metric_type STARTS WITH (hash-1a3f5a1f)))
   24/10/14 20:39:09 INFO BaseDistributedDataScan: Planning file tasks locally 
for table icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics
   24/10/14 20:39:09 INFO SparkPartitioningAwareScan: Reporting 
UnknownPartitioning with 1 partition(s) for table 
icefireonni_prod_us_east4.offer.stg_rollup_experiment_metrics
   24/10/14 20:39:09 INFO SnapshotScan: Scanning table 
icefireonni_prod_us_east4.offer.dim_maestro_staging_merge_20241010_0 snapshot 
8676148312147584490 created at 2024-10-11T14:45:14.652+00:00 with filter 
((((((((((ds IS NOT NULL AND company_id IS NOT NULL) AND metric_type IS NOT 
NULL) AND ds = (5-digit-int)) AND company_id = (hash-0744313f)) AND ((val IS 
NOT NULL AND total_calculation = (hash-7dc548bb)) OR (denominator IS NOT NULL 
AND total_calculation = (hash-641e85bb)))) AND metric_hash_bucket IN 
((1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), (1-digit-int), 
(1-digit-int))) AND (metric_type IS NOT NULL AND metric_type NOT IN 
((hash-697c954f), (hash-234c9bed), (hash-7dc548bb), (hash-60702892)))) AND NOT 
(metric_type STARTS WITH (hash-1a3f5a1f))) AND metric_rollup IS NOT NULL) AND 
total_calculation IS NOT NULL)
   24/10/14 20:39:10 INFO BaseDistributedDataScan: Planning file tasks locally 
for table icefireonni_prod_us_east4.offer.dim_maestro_staging_merge_20241010_0
   24/10/14 20:39:10 INFO SparkPartitioningAwareScan: Reporting 
UnknownPartitioning with 3600 partition(s) for table 
icefireonni_prod_us_east4.offer.dim_maestro_staging_merge_20241010_0
   24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: 
Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on 
node Exchange.
   24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: 
Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on 
node Exchange.
   24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: 
Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on 
node Exchange.
   24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: 
Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on 
node Exchange.
   24/10/14 20:39:12 WARN GlutenFallbackReporter: Validation failed for plan: 
Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on 
node Exchange.
   24/10/14 20:39:13 WARN GlutenFallbackReporter: Validation failed for plan: 
Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on 
node Exchange.
   24/10/14 20:39:13 WARN GlutenFallbackReporter: Validation failed for plan: 
Exchange[QueryId=8], due to: [FallbackByBackendSettings] Validation failed on 
node Exchange.
   24/10/14 20:39:17 WARN DAGScheduler: Broadcasting large task binary with 
size 1342.3 KiB
   24/10/14 20:39:32 INFO RequestTracker: Detected high latency for 
[url=https://storage.googleapis.com/upload/storage/v1/b/dataproc-temp-us-east4-916951520157-nbvls5dr/o?ifGenerationMatch=0&name=3e18e8c9-fa20-4014-85bc-8f84149cac4f/spark-job-history/eventlog_v2_application_1728938071821_0001/_GHFS_SYNC_TMP_FILE_events_1_application_1728938071821_0001.11.0cc1edbb-37d0-4fc8-99ff-9b4001430b2e&uploadType=resumable&upload_id=AHmUCY0V93SbWXKJPaWbUov30tRPsagIb0FyImXigiW6L86XyLqeZf73pSYsEGje_tiSdZ6euuTF5bK9Jek-0J902Bp4v3SQFGSvzDjcEHRVI-MS_w;
 invocationId=gccl-invocation-id/3e9e18b2-3656-43d9-8aae-a497c2338eae]. 
durationMs=249; method=PUT [CONTEXT ratelimit_period="10 SECONDS" ]
   24/10/14 20:39:32 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate 
limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data 
for 
gs://dataproc-temp-us-east4-916951520157-nbvls5dr/3e18e8c9-fa20-4014-85bc-8f84149cac4f/spark-job-history/eventlog_v2_application_1728938071821_0001/events_1_application_1728938071821_0001
 [CONTEXT ratelimit_period="1 MINUTES [skipped: 77]" ]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Crash onspark 3.5 using GCS + iceberg ubuntu 22.4 [incubator-gluten]

Reply via email to