cccs-eric opened a new issue, #5719:
URL: https://github.com/apache/iceberg/issues/5719
### Apache Iceberg version
0.14.0 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
I have a Spark job that used to run fine until I upgraded my stack to Spark
3.3.0/Iceberg 0.14. The job has been running flawlessly and since the upgrade,
I get an error
(`org.apache.iceberg.shaded.org.apache.avro.AvroRuntimeException:
java.io.IOException: Invalid sync!`) ~80% of the time. The Spark job does the
following:
1. It starts by reading a bunch of files from a location
2. It processes those files and writes an output to an Iceberg table
partitioned by day and a boolean attribute. The job runs daily, so in order to
be able to reprocess a day in the past, it writes to the table using
`overwritePartitions`.
3. Once writing is done, it calls
```
# partition_date is the day being processed
CALL {catalog}.system.rewrite_data_files(
table => '{namespace}.{table}',
options => map('target-file-size-bytes', '54525952'),
where => 'timeperiod_loadedBy = "{partition_date.strftime("%Y-%m-%d")}"'
)
```
and then
```
CALL {catalog}.system.expire_snapshots(
table => '{namespace}.{table}',
retain_last => 1,
older_than => TIMESTAMP '{datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
max_concurrent_deletes => 10,
stream_results => true)
```
The error happens during `expire_snapshots`, while doing a `broadcast
exchange` stage. I'm attaching a screenshot showing a successful run:
<img width="1898" alt="stages"
src="https://user-images.githubusercontent.com/56447460/188884274-3e8ee4fc-75ed-4795-a3e3-454d44de9ec2.png">
Since the job overwrites the partition, I can re-run it when it fails and it
will "eventually" work. Could it be related to a timing/concurrent issue?
Here is the stack trace:
```
22/09/06 13:14:19 WARN TaskSetManager: Lost task 370.0 in stage 7.0 (TID
51769) (10.244.104.4 executor 4):
org.apache.iceberg.shaded.org.apache.avro.AvroRuntimeException:
java.io.IOException: Invalid sync!
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:224)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.avro.AvroIterable$AvroReuseIterator.hasNext(AvroIterable.java:192)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.io.CloseableIterable$4$1.hasNext(CloseableIterable.java:108)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:50)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.io.CloseableIterable$4$1.hasNext(CloseableIterable.java:108)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.io.CloseableIterator$2.hasNext(CloseableIterator.java:71)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.io.ClosingIterator.hasNext(ClosingIterator.java:40)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
Source)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.scheduler.Task.run(Task.scala:136)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
java.base/java.lang.Thread.run(Thread.java:829)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - Caused by:
java.io.IOException: Invalid sync!
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.nextRawBlock(DataFileStream.java:319)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - at
org.apache.iceberg.shaded.org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:213)
[2022-09-06, 13:14:19 UTC] {process_utils.py:54} INFO - ... 27 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]