[I] [Spark] parquet ingestion to azure gets stuck when hadoop fs cache is disabled [iceberg]

via GitHub Sun, 31 May 2026 12:18:24 -0700


palladium-coder opened a new issue, #16640:
URL: https://github.com/apache/iceberg/issues/16640


   ### Apache Iceberg version
   
   1.11.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Hello
   
   We are ingesting files to azure via iceberg+spark. We have disabled the 
hadoop filesystem cache via config `fs.abfs.impl.disable.cache`=`true`. During 
ingestion, the job was getting stuck with the below error
   
   ```log
   [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] ERROR 
org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not submit 
task to executor java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated, 
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
   ```
   
   
   Looking at the debug logs, I see that finalize was called on 
AzureBlobFileSystem (indicating the fs was garbage collected) and parquet 
attempted to write using the finalized fs 
   ```log
   18:18:20.127 [Finalizer] DEBUG 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem - finalize() called.
   18:18:20.128 [Finalizer] DEBUG 
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore - Gracefully shutting 
down tor service 
BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=48, 
available=48, waiting=0}, eCount=0}. Waiting max 30 SECONDS
   18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4: 
write data pages
   18:18:20.209 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 4: 
write data pages content
   18:18:20.210 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: 
end column
   18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: 
write data pages
   18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 55: 
write data pages content
   18:18:20.213 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: 
end column
   18:18:20.219 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: 
end block
   18:18:20.220 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 100: 
column indexes
   18:18:20.231 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 148: 
offset indexes
   18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171: 
bloom filters
   18:18:20.234 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 171: 
end
   18:18:20.418 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
DEBUG org.apache.iceberg.shaded.org.apache.et.hadoop.ParquetFileWriter - 641: 
footer length = 470
   18:18:20.421 [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] 
ERROR org.apache.hadoop.util.BlockingThreadPoolExecutorService - Could not 
submit task to executor 
java.util.concurrent.ThreadPoolExecutor@24ad434d[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 0]
   ``` 
   
   The minimal code example to reproduce (to get the above error we had to 
trigger the GC often, so it has to be run with jvm args `-Xmx480m 
-XX:G1ReservePercent=50`)
   ```java
   var spark = SparkSession.builder()
           .master("local[*]")
           .appName("test")
           .config("spark.sql.catalog.spark_catalog", 
"org.apache.iceberg.spark.SparkCatalog")
           .config("spark.sql.catalog.spark_catalog.type", "hadoop")
           .config("spark.sql.catalog.spark_catalog.warehouse", 
"abfs://<container>@<storage>.[dfs.core.windows.net](http://dfs.core.windows.net/)")
           // setup rest of azure secrets for the storage account
           .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
           .config("spark.log.level", "debug")
           .config("fs.abfs.impl.disable.cache", "true")
           .getOrCreate()
   
   spark.sql("CREATE TABLE test_table_x (id LONG, data STRING) " +
           "USING iceberg " +
           "TBLPROPERTIES (" +
           "  'write.format.default'='parquet'"+
           ")");
   for (int i = 0; i < 20; i++) {
       spark.sql("INSERT INTO test_table_x VALUES (1, 'a'), (2, 'b'), (3, 'c'), 
(4, 'd'), (5, 'e'), (6, 'f'), (7, 'g'),(8, 'h'), (9, 'i'), (10, 'j')");
   }
   ```
   
   The dependencies used are as below
   
   ```xml
   <dependencies>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-core_2.12</artifactId>
           <version>3.5.8</version>
       </dependency>
       <dependency>
           <groupId>org.apache.spark</groupId>
           <artifactId>spark-sql_2.12</artifactId>
           <version>3.5.8</version>
       </dependency>
       <dependency>
           <groupId>org.apache.iceberg</groupId>
           <artifactId>iceberg-spark-runtime-3.5_2.12</artifactId>
           <version>1.11.0</version>
       </dependency>
       <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-azure</artifactId>
           <version>3.3.6</version>
       </dependency>
       <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-common</artifactId>
           <version>3.3.6</version>
       </dependency>
       <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-client-api</artifactId>
           <version>3.3.6</version>
       </dependency>
   </dependencies>
   ```
   
   java version : 17.0.18
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Spark] parquet ingestion to azure gets stuck when hadoop fs cache is disabled [iceberg]

Reply via email to