mbutrovich commented on PR #4335:
URL: 
https://github.com/apache/datafusion-comet/pull/4335#issuecomment-4478448502

   Thanks for the careful citations @karuppayya, those helped a lot.
   
   I'm trying to figure out whether the broadcast and refresh machinery in this 
PR is something Comet should own, or whether a vendor implementation under 
#4309's SPI could give you the same end result with less Comet-side surface 
area. I'd love your read on that.
   
   One scope thing I want to flag up front: we also need this for raw Parquet 
on S3, and any other native S3 reader Comet grows. Iceberg-vended creds are one 
important case, but vendors with custom signers or per-path STS for plain 
Parquet hit the same wall (Hadoop S3A signers don't surface the secret, 
`AWSCredentialsProvider.getCredentials()` is parameterless). The SPI shape in 
this PR is fairly Iceberg-shaped (named `CometCredentialProvider` under 
`org.apache.comet.iceberg`, `initialize(catalogProperties)` keyed off the 
catalog property bag, `ResolveContext(tableLocation)`), and it's not obvious 
how the Parquet path fits since there's no catalog or table location. #4309 
covers both paths through one SPI under `org.apache.comet.cloud.s3` with a 
per-call `(bucket, path, mode)` context that works for either, and I'd want 
whatever we land to keep that breadth.
   
   On the REST-vended credentials point, the reason #4309 today can't reach 
`credentials.uri` and friends is that `CometScanRule.filterStorageProperties` 
strips them down to `s3.`/`gcs.`/`adls.`/`client.` prefixes before they cross 
JNI. If we relax that filter so the full FileIO property map crosses as 
`catalog_properties`, and we only narrow to storage-prefix keys at the 
iceberg-rust `FileIOBuilder.with_prop` call site, then a vendor under #4309 
sees everything `LoadTableResponse` returned. That seems to obsolete the 
parallel `allFileIOProperties` field this PR adds, with a much smaller change.
   
   On multi-catalog with shared FQCN, you're right that #4309 has the gap 
today. I think it's fixable inside the existing SPI shape by extending 
`CometS3CredentialDispatcher` to cache by `(FQCN, catalog id)` and adding 
`initialize(Map<String,String> catalogProperties)` to 
`CometS3CredentialProvider`, called once per catalog on first instantiation. 
That mirrors `S3FileIO.initialize`, which is the shape vendors already write 
against for Iceberg.
   
   What I'm less sure should live in Comet is refresh and distribution. The 
Rust-side TTL cache, the per-plan single-catalog invariant that throws on 
multi-catalog plans, and the driver-side broadcast cache feel like things 
Iceberg's own `CachedSupplier` / `VendedCredentialsProvider` and Spark's 
broadcast already cover, just inside the vendor's implementation rather than in 
Comet. Concretely, an Iceberg REST vended-credential 
`CometS3CredentialProvider` would look roughly like this:
   
   ```java
   public class IcebergRESTVendedS3Provider implements 
CometS3CredentialProvider {
     private volatile VendedCredentialsProvider provider;
   
     @Override
     public void initialize(Map<String, String> catalogProperties) {
       this.provider = VendedCredentialsProvider.create(catalogProperties);
     }
   
     @Override
     public CometS3Credentials getCredentialsForPath(
         String bucket, String path, CometS3AccessMode mode) {
       AwsSessionCredentials c = (AwsSessionCredentials) 
provider.resolveCredentials();
       long expiresAt = 
c.expirationTime().map(Instant::toEpochMilli).orElse(-1L);
       return new CometS3Credentials(
           c.accessKeyId(), c.secretAccessKey(), c.sessionToken(), expiresAt);
     }
   }
   ```
   
   Iceberg's `CachedSupplier` inside `VendedCredentialsProvider` handles 
refresh and prefetch, so neither the vendor nor Comet has to reinvent it. We 
could ship something like this as a reference implementation in Comet alongside 
the SPI so REST users aren't writing it from scratch, while keeping 
caching/refresh/distribution policy out of Comet itself.
   
   Does this shape cover what you need, or is there a scenario I'm missing 
where the vendor model in #4309 (extended with the unfiltered properties + 
`initialize(Map)`) can't reach? Want to make sure I'm not arguing against a 
real requirement.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to