rdblue commented on code in PR #15561:
URL: https://github.com/apache/iceberg/pull/15561#discussion_r2921467152
##########
core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java:
##########
@@ -85,7 +85,7 @@ class RESTTableScan extends DataTableScan {
private final Object hadoopConf;
private final FileIO tableIO;
private String planId = null;
- private FileIO fileIOForPlanId = null;
+ private FileIO scanFileIO = null;
Review Comment:
I'm looking at the information passed around in this class more based on
having a couple of unnecessary fields and I have a few more questions:
Why does this use `Map<String, String> headers` rather than
`Supplier<Map<String, String>> headers` that the table has? Doesn't this mean
that the headers from the auth session are static? How does credential refresh
work if the authentication header is stale?
Was the `TableOperations` field originally used? It isn't used now, other
than to pass it to refined scans. I think it should be removed from both the
constructor and the fields.
This passes `ResourcePaths` and `TableIdentifier` to build 2 paths that
could be passed in a 1 path using the plan ID. I think this should be
reconsidered. The plan and fetch endpoints can be passed in as simple strings,
and it would simplify this to use a `Function<String, String>` to construct the
plan path with ID. Another option is to create an object similar to
`ResourcePaths` that is specific to a table (hides `TableIdentifier`) and use
that. Either one would be cleaner.
This also passes `supportedEndpoints`, which spreads out the logic for how
to handle endpoints that aren't supported. These endpoints are also checked
when needed, so this code will create a plan request and could then fail to
fetch tasks if the fetch endpoint isn't supported. It also throws error
messages that are not helpful, like "Server does not support endpoint: GET
/v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}" instead of
"Invalid status: submitted (service does not support async planning)" that
would be more helpful.
A cleaner way to handle endpoints is to verify required endpoints before
creating this scan. Both the plan and fetch endpoints should be required. Then
the optional endpoint should be booleans, like `supportsAsync` and
`supportsCancel`.
Last, this leaks catalog properties and the Hadoop conf so that
`CatalogUtil.loadFileIO` can be called with `List<Credential>`. Why not pass a
function to create the `FileIO` so that the properties and conf are contained
in the REST table operations?
I think this works right now and these aren't blockers (other than the
potential auth session issue), but I would really like to see this class
simplified by reducing the number of things that have to be passed to it and
remove some of the things that are done here, like handling endpoint checks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]