Re: [I] Spark Iceberg streaming - checkpoint leverages S3fileIo signer path instead of hadoop's S3AFileSystem [iceberg]

via GitHub Wed, 10 Dec 2025 00:00:34 -0800


nastra commented on issue #14762:
URL: https://github.com/apache/iceberg/issues/14762#issuecomment-3635843477


   @koombal can you please share your entire catalog configuration and also 
what does `ckptDir` point to in `.option("checkpointLocation", ckptDir)`?
   
   > @nastra any idea why scala uses the tables FileIO while with python 
doesn't?
   
   I don't know, I've been only testing this through our 
[TestStructuredStreamingRead](https://github.com/apache/iceberg/blob/24ca356fb4ecd48d593949fd25c852c21bc87d53/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreamingRead3.java#L73)
 tests and all of those use a temp location for checkpoints. I applied this 
diff for testing to modify the FileIO to be used
   ```
   +import org.apache.iceberg.inmemory.InMemoryFileIO;
    import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
    import org.junit.jupiter.api.extension.ExtendWith;
   
   @@ -52,6 +53,7 @@ public abstract class CatalogTestBase extends 
TestBaseWithCatalog {
            ImmutableMap.builder()
                .putAll(SparkCatalogConfig.REST.properties())
                .put(CatalogProperties.URI, 
restCatalog.properties().get(CatalogProperties.URI))
   +            .put(CatalogProperties.FILE_IO_IMPL, 
InMemoryFileIO.class.getName())
                .build()
          }
        };
   ````
   
   and the checkpoint is definitely using the table's FileIO because the tests 
would then fail with
   ```
   No in-memory file found for location: 
file:/var/folders/q3/m55sggg13cz2dpssg7l_q9300000gp/T/junit-395966452732806867/writer-checkpoint-folder/writer-checkpoint/sources/0/offsets/0
   org.apache.iceberg.exceptions.NotFoundException: No in-memory file found for 
location: 
file:/var/folders/q3/m55sggg13cz2dpssg7l_q9300000gp/T/junit-395966452732806867/writer-checkpoint-folder/writer-checkpoint/sources/0/offsets/0
        at 
app//org.apache.iceberg.inmemory.InMemoryFileIO.newInputFile(InMemoryFileIO.java:48)
        at 
app//org.apache.iceberg.spark.source.SparkMicroBatchStream$InitialOffsetStore.initialOffset(SparkMicroBatchStream.java:558)
        at 
app//org.apache.iceberg.spark.source.SparkMicroBatchStream.<init>(SparkMicroBatchStream.java:116)
        at 
app//org.apache.iceberg.spark.source.SparkScan.toMicroBatchStream(SparkScan.java:170)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Spark Iceberg streaming - checkpoint leverages S3fileIo signer path instead of hadoop's S3AFileSystem [iceberg]

Reply via email to