RussellSpitzer commented on code in PR #9731:
URL: https://github.com/apache/iceberg/pull/9731#discussion_r1809392091
##########
api/src/main/java/org/apache/iceberg/actions/RewriteManifests.java:
##########
@@ -44,6 +47,43 @@ public interface RewriteManifests
*/
RewriteManifests rewriteIf(Predicate<ManifestFile> predicate);
+ /**
+ * Rewrite manifests in a given order, based on partition field names
+ *
+ * <p>Supply an optional set of partition field names to cluster the
rewritten manifests by. For
+ * example, given a table PARTITIONED BY (a, b, c, d), you may wish to
rewrite and cluster
+ * manifests by ('d', 'b') only, based on your query patterns. Rewriting
Manifests in this way
+ * will yield manifest_lists that point to manifest_files containing data
files for common 'd' and
+ * 'b' partitions.
+ *
+ * <p>If not set, manifests will be rewritten in the order of the transforms
in the table's
+ * current partition spec.
+ *
+ * @param partitionFieldClustering Exact transformed column names used for
partitioning; not the
+ * raw column names that partitions are derived from. E.G. supply
'data_bucket' and not 'data'
+ * for a bucket(N, data) partition * definition
+ * @return this for method chaining
+ */
+ default RewriteManifests clusterBy(List<String> partitionFieldClustering) {
+ throw new UnsupportedOperationException(
+ this.getClass().getName() + " doesn't implement
clusterBy(List<String>)");
+ }
+
+ /**
+ * Rewrite manifests in a given order, dictated by a custom Function
+ *
+ * <p>Supply a Function which will apply its own custom clustering logic
based on supplied {@link
+ * org.apache.iceberg.DataFile} attributes.
+ *
+ * @param clusterStrategyFunction A Function that returns a String to be
used for manifest
+ * clustering
+ * @return this method for chaining
+ */
+ default RewriteManifests clusterBy(Function<DataFile, String>
clusterStrategyFunction) {
Review Comment:
@zachdisc I really don't want to widen the api if we can at all help it,
adding a Function<> api is something that will be hard for us to walk back. I
would feel more comfortable if we started with partition_transforms and
consider opening things up more as community demand becomes a bit more
pronounced.
I think a lot of your use cases will work better if we allow users to sort
on any column. We can't really do that at the moment, but we have a lot of
plans for V4 metadata that will allow us to bring all column min/maxes up to
the manifest_list level.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]