amogh-jahagirdar commented on code in PR #14867:
URL: https://github.com/apache/iceberg/pull/14867#discussion_r2683382828
##########
core/src/main/java/org/apache/iceberg/rest/RESTCatalogProperties.java:
##########
@@ -37,12 +37,107 @@ private RESTCatalogProperties() {}
public static final String NAMESPACE_SEPARATOR = "namespace-separator";
- // Enable planning on the REST server side
- public static final String REST_SCAN_PLANNING_ENABLED =
"rest-scan-planning-enabled";
- public static final boolean REST_SCAN_PLANNING_ENABLED_DEFAULT = false;
+ // Configure scan planning mode
+ // Can be set by server in LoadTableResponse.config() or by client in
catalog properties
+ // Negotiation rules: ONLY beats PREFERRED, both PREFERRED = client wins
+ // Default when neither client nor server provides: client-preferred
+ public static final String SCAN_PLANNING_MODE = "scan-planning-mode";
+ public static final String SCAN_PLANNING_MODE_DEFAULT =
+ ScanPlanningMode.CLIENT_PREFERRED.modeName();
public enum SnapshotMode {
ALL,
REFS
}
+
+ /**
+ * Enum to represent scan planning mode configuration.
+ *
+ * <p>Can be configured by:
+ *
+ * <ul>
+ * <li>Server: Returned in LoadTableResponse.config() to advertise server
preference/requirement
+ * <li>Client: Set in catalog properties to set client
preference/requirement
+ * </ul>
+ *
+ * <p>When both client and server configure this property, the values are
negotiated:
+ *
+ * <p>Values:
+ *
+ * <ul>
+ * <li>CLIENT_ONLY - MUST use client-side planning. Fails if paired with
CATALOG_ONLY from other
Review Comment:
>I am using py-iceberg, i know i am low on resources its better i just do
remote planning if possible and the table is big and catalog can py-iceberg can
say i prefer catalog to be planned and server based on catalog_only /
catalog_preferred can have that negotiation.
Yeah I guess I'm mainly coming from the perspective that if a user is
running PyIceberg in a low resource environment, then a user would either
knowingly explicitly configure the client property to use remote planning, or
PyIceberg would internally choose what planning it wants when it's optional
(could be something simple like just do client planning, could be heuristics
based, it's all up to client implementations).
It's nice that the server could use this as a dynamic mechanism to control
planning based on the load but I think there are already mechanisms for that. A
server could just throttle a client initiated planning, and then a client could
fall back to using client side planning for instance. This doesn't require
additional protocol complexity to support today (I believe).
>Let say i am spark and i have big compute infra, but i based on the current
workload,
lets say a lot of concurrent queries env, I will not have a lot of memory
available to plan this, i would start with saying i prefer catalog
let say i have dedicated cluster rather than doing remote plan i would do it
in my JVM, i would say client_only from the client side
Yeah, same principle as the PyIceberg case imo, I feel like in these
circumstances a user would either explicitly configure stuff, and if we need a
little bit more dynamism based on server/client load, we'd build that logic
directly in the client without specing out preferences.
As far as I can tell, the main benefit of codifying preferences in the spec
is that it standardizes client behavior when the endpoint is optional but not
required (i.e. we know exactly what PyIceberg, Java, Rust etc would do in this
situation given some combination of options in that matrix). With my approach,
there'd be deviation in client behavior across different implementations, but I
personally think that's kind of an advantage in this case.
I personally don't feel like that's super useful but as I said, I'm willing
to move forward here since I guess these additional options aren't _that_
complicated for clients to implement and there's some level of benefit I can
see to standardizing behavior across clients.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]