[GitHub] [iceberg] cccs-eric opened a new issue, #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

GitBox Wed, 07 Sep 2022 06:07:03 -0700


cccs-eric opened a new issue, #5719:
URL: https://github.com/apache/iceberg/issues/5719


   ### Apache Iceberg version
   
   0.14.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have a Spark job that used to run fine until I upgraded my stack to Spark 
3.3.0/Iceberg 0.14.  The job has been running flawlessly and since the upgrade, 
I get an error 
(`org.apache.iceberg.shaded.org.apache.avro.AvroRuntimeException: 
java.io.IOException: Invalid sync!`) ~80% of the time.  The Spark job does the 
following:
   
   1. It starts by reading a bunch of files from a location
   2. It processes those files and writes an output to an Iceberg table 
partitioned by day and a boolean attribute.  The job runs daily, so in order to 
be able to reprocess a day in the past, it writes to the table using 
`overwritePartitions`.
   3. Once writing is done, it calls
   ```
   # partition_date is the day being processed
   CALL {catalog}.system.rewrite_data_files(
       table => '{namespace}.{table}',
       options => map('target-file-size-bytes', '54525952'),
       where => 'timeperiod_loadedBy = "{partition_date.strftime("%Y-%m-%d")}"'
   )
   ```
   and then
   ```
   CALL {catalog}.system.expire_snapshots(
       table => '{namespace}.{table}',
       retain_last => 1,
       older_than => TIMESTAMP '{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
       max_concurrent_deletes => 10,
       stream_results => true)
   ```
   The error happens during `expire_snapshots`, while doing a `broadcast 
exchange` stage.  I'm attaching a screenshot showing a successful run:
   <img width="1898" alt="stages" 
src="https://user-images.githubusercontent.com/56447460/188884274-3e8ee4fc-75ed-4795-a3e3-454d44de9ec2.png";>
   
   Since the job overwrites the partition, I can re-run it when it fails and it 
will "eventually" work.  Could it be related to a timing/concurrent issue?
   
   Here is the stack trace:
   ```
   22/09/06 13:14:19 WARN TaskSetManager: Lost task 370.0 in stage 7.0 (TID 
51769) (10.244.104.4 executor 4): 
org.apache.iceberg.shaded.org.apache.avro.AvroRuntimeException: 
java.io.IOException: Invalid sync!
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:224)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.hasNext(AvroIterable.java:192)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.io.CloseableIterable$4$1.hasNext(CloseableIterable.java:108)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.io.CloseableIterable$4$1.hasNext(CloseableIterable.java:108)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.io.CloseableIterator$2.hasNext(CloseableIterator.java:71)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.io.ClosingIterator.hasNext(ClosingIterator.java:40)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
 Source)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.scheduler.Task.run(Task.scala:136)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
java.base/java.lang.Thread.run(Thread.java:829)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - Caused by: 
java.io.IOException: Invalid sync!
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:319)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      at 
org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:213)
   [2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO -      ... 27 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] cccs-eric opened a new issue, #5719: Avro invalid sync since upgrade to Spark 3.3.0 and Iceberg 0.14

Reply via email to