[I] Spark query failure while using SPJ [iceberg]

via GitHub Thu, 12 Mar 2026 11:27:52 -0700


vitaliikalmykov opened a new issue, #15602:
URL: https://github.com/apache/iceberg/issues/15602


   ### Apache Iceberg version
   
   1.10.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Hello.
   Spark job is failing during merge query is SPJ is enabled.
   Merge query:
   `spark.sql("""merge into lines_actions t
   using staging.lines_actions s
   on  t.created>='2026-03-10' and 
      t.filid = s.filid and  t.actiontype = s.actiontype and  t.actionid = 
s.actionid
   when matched
      and t.__source_ts_ms<s.__source_ts_ms 
   then update set
     created = s.created, discpercent = s.discpercent, discount = s.discount, 
varchardata = s.varchardata, __deleted = s.__deleted, __op = s.__op, __table = 
s.__table, __source_ts_ms = s.__source_ts_ms, __topic = s.__topic, __schema_id 
= s.__schema_id
   when not matched
   then
     insert (filid, created, actiontype, actionid, discpercent, discount, 
varchardata, __deleted, __op, __table, __source_ts_ms, __topic, __schema_id)
     values (s.filid, s.created, s.actiontype, s.actionid, s.discpercent, 
s.discount, s.varchardata, s.__deleted, s.__op, s.__table, s.__source_ts_ms, 
s.__topic, s.__schema_id)
   """)`
   Tables DDL:
   `CREATE TABLE lines_actions (
      filid integer,
      created timestamp,
      actiontype integer,
      actionid integer,
      discpercent decimal(9, 3),
      discount decimal(9, 2),
      varchardata varchar,
      __deleted varchar,
      __op varchar,
      __table varchar,
      __source_ts_ms bigint,
      __topic varchar,
      __schema_id integer
   )
   WITH (
      partitioning = ARRAY['month(created)'],
      sorted_by = ARRAY['created ASC NULLS FIRST']
   );`
   
   SPJ config:
   `"spark.sql.sources.v2.bucketing.enabled":"true",
   "spark.sql.sources.v2.bucketing.pushPartValues.enabled":"true",
   "spark.sql.iceberg.planning.preserve-data-grouping":"true",
   "spark.sql.requireAllClusterKeysForCoPartition":"false",
   
"spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled":"true",
   "spark.sql.join.preferSortMergeJoin":"false"
   "spark.sql.autoBroadcastJoinThreshold": "-1"
   "spark.sql.adaptive.enabled": "false"`
   
   Error message:
   `java.lang.UnsupportedOperationException: Can't retrieve values from an 
empty struct
     at org.apache.iceberg.EmptyStructLike.get(EmptyStructLike.java:40)
     at 
org.apache.iceberg.spark.source.StructInternalRow.isNullAt(StructInternalRow.java:104)
     at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering.compare(Unknown
 Source)
     at 
org.apache.spark.sql.catalyst.util.InternalRowComparableWrapper.equals(InternalRowComparableWrapper.scala:51)
     at scala.runtime.BoxesRunTime.equals2(BoxesRunTime.java:137)
     at scala.runtime.BoxesRunTime.equals(BoxesRunTime.java:123)
     at scala.collection.immutable.Map$Map1.getOrElse(Map.scala:168)
     at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.$anonfun$inputRDD$14(BatchScanExec.scala:218)
     at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
     at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
     at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
     at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD$lzycompute(BatchScanExec.scala:215)
     at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.inputRDD(BatchScanExec.scala:143)
     at 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExecBase.doExecuteColumnar(DataSourceV2ScanExecBase.scala:214)
     at 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2ScanExecBase.doExecuteColumnar$(DataSourceV2ScanExecBase.scala:212)
     at 
org.apache.spark.sql.execution.datasources.v2.BatchScanExec.doExecuteColumnar(BatchScanExec.scala:41)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:255)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:305)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:152)
     at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:280)
     at 
org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:251)
     at 
org.apache.spark.sql.execution.InputAdapter.doExecuteColumnar(WholeStageCodegenExec.scala:678)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:255)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:305)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:152)
     at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:280)
     at 
org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:251)
     at 
org.apache.spark.sql.execution.ColumnarToRowExec.inputRDDs(Columnar.scala:432)
     at 
org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:313)
     at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:992)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:228)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:305)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:152)
     at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:280)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:224)
     at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:434)
     at 
org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:534)
     at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.org$apache$spark$sql$execution$exchange$BroadcastExchangeExec$$doComputeRelation(BroadcastExchangeExec.scala:198)
     at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:191)
     at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anon$1.doCompute(BroadcastExchangeExec.scala:184)
     at 
org.apache.spark.sql.execution.AsyncDriverOperation.$anonfun$compute$1(AsyncDriverOperation.scala:75)
     at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
     at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:391)
     at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:383)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withExecutionId$1(SQLExecution.scala:366)
     at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:412)
     at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:363)
     at 
org.apache.spark.sql.execution.AsyncDriverOperation.compute(AsyncDriverOperation.scala:69)
     at 
org.apache.spark.sql.execution.AsyncDriverOperation.$anonfun$computeFuture$1(AsyncDriverOperation.scala:55)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$2(SQLExecution.scala:435)
     at 
org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:430)
     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
     at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
     at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
     at java.base/java.lang.Thread.run(Thread.java:840)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.checkNoFailures(AdaptiveExecutor.scala:175)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.doRun(AdaptiveExecutor.scala:97)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.tryRunningAndGetFuture(AdaptiveExecutor.scala:75)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveExecutor.execute(AdaptiveExecutor.scala:59)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:304)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:996)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:303)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:610)
     at 
org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecute(AdaptiveSparkPlanExec.scala:596)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:228)
     at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:305)
     at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:152)
     at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:280)
     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:224)
     at 
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:1326)
     at 
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:1324)
     at 
org.apache.spark.sql.execution.datasources.v2.ReplaceDataExec.writeWithV2(WriteToDataSourceV2Exec.scala:1236)
     at 
org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run(WriteToDataSourceV2Exec.scala:1302)
     at 
org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run$(WriteToDataSourceV2Exec.scala:1301)
     at 
org.apache.spark.sql.execution.datasources.v2.ReplaceDataExec.run(WriteToDataSourceV2Exec.scala:1236)
     at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
     at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
     at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
     
     ... 46 elided`
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Spark query failure while using SPJ [iceberg]

Reply via email to