palladium-coder opened a new issue, #16640:
URL: https://github.com/apache/iceberg/issues/16640
### Apache Iceberg version
1.11.0 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
Hello
We are ingesting files to azure via iceberg+spark. We have disabled the
hadoop filesystem cache via config `fs.abfs.impl.disable.cache`=`true`. During
ingestion, the job was getting stuck with the below error
```log
[Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] ERROR
org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not submit
task to executor java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
```
Looking at the debug logs, I see that finalize was called on
AzureBlobFileSystem (indicating the fs was garbage collected) and parquet
attempted to write using the finalized fs
```log
18:18:20.127 [Finalizer] DEBUG
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem - finalize() called.
18:18:20.128 [Finalizer] DEBUG
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore - Gracefully shutting
down tor service
BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=48,
available=48, waiting=0}, eCount=0}. Waiting max 30 SECONDS
18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4:
write data pages
18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4:
write data pages content
18:18:20.210 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55:
end column
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55:
write data pages
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55:
write data pages content
18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100:
end column
18:18:20.219 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100:
end block
18:18:20.220 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100:
column indexes
18:18:20.231 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 148:
offset indexes
18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171:
bloom filters
18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171:
end
18:18:20.418 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 641:
footer length = 470
18:18:20.421 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]
ERROR org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not
submit task to executor
java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated, pool size = 0,
active threads = 0, queued tasks = 0, completed tasks = 0]
```
The minimal code example to reproduce (to get the above error we had to
trigger the GC often, so it has to be run with jvm args `-Xmx480m
-XX:G1ReservePercent=50`)
```java
var spark = SparkSession.builder()
.master("local[*]")
.appName("test")
.config("spark.sql.catalog.spark_catalog",
"org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.spark_catalog.type", "hadoop")
.config("spark.sql.catalog.spark_catalog.warehouse",
"abfs://<container>@<storage>.[dfs.core.windows.net](http://dfs.core.windows.net/)")
// setup rest of azure secrets for the storage account
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.log.level", "debug")
.config("fs.abfs.impl.disable.cache", "true")
.getOrCreate()
spark.sql("CREATE TABLE test_table_x (id LONG, data STRING) " +
"USING iceberg " +
"TBLPROPERTIES (" +
" 'write.format.default'='parquet'"+
")");
for (int i = 0; i < 20; i++) {
spark.sql("INSERT INTO test_table_x VALUES (1, 'a'), (2, 'b'), (3, 'c'),
(4, 'd'), (5, 'e'), (6, 'f'), (7, 'g'),(8, 'h'), (9, 'i'), (10, 'j')");
}
```
The dependencies used are as below
```xml
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.8</version>
</dependency>
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-spark-runtime-3.5_2.12</artifactId>
<version>1.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-api</artifactId>
<version>3.3.6</version>
</dependency>
</dependencies>
```
java version : 17.0.18
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]