aokolnychyi commented on code in PR #4708: URL: https://github.com/apache/iceberg/pull/4708#discussion_r874125403
########## api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java: ########## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.actions; + +public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> { Review Comment: Shall we add a few sentences about the purpose of this action similar to other actions? ########## api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java: ########## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.actions; + +public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> { + /** + * Emit changed data set by a snapshot id. + * + * @param snapshotId id of the snapshot to generate changed data + * @return this for method chaining + */ + GenerateChangeSet forSnapshot(long snapshotId); + + /** + * Emit changed data set for the current snapshot. + * + * @return this for method chaining + */ + GenerateChangeSet forCurrentSnapshot(); + + /** + * Emit changed data from a particular snapshot(exclusive). + * + * @param fromSnapshotId id of the start snapshot + * @return this for method chaining + */ + GenerateChangeSet afterSnapshot(long fromSnapshotId); Review Comment: This naming seems good to me but I wonder whether we should follow whatever was recently added in #4580. One thing I like about that API is using explicit inclusive/exclusive words whenever configuring the start snapshot. ``` GenerateChangeSet fromSnapshotInclusive(long fromSnapshotId); GenerateChangeSet fromSnapshotExclusive(long fromSnapshotId); GenerateChangeSet toSnapshot(long toSnapshotId); ``` Thoughts, @flyrain @stevenzwu @szehon-ho @rdblue @RussellSpitzer? ########## api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java: ########## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.actions; + +public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> { + /** + * Emit changed data set by a snapshot id. + * + * @param snapshotId id of the snapshot to generate changed data + * @return this for method chaining + */ + GenerateChangeSet forSnapshot(long snapshotId); + + /** + * Emit changed data set for the current snapshot. + * + * @return this for method chaining + */ + GenerateChangeSet forCurrentSnapshot(); + + /** + * Emit changed data from a particular snapshot(exclusive). + * + * @param fromSnapshotId id of the start snapshot + * @return this for method chaining + */ + GenerateChangeSet afterSnapshot(long fromSnapshotId); + + /** + * Emit change data set from the start snapshot (exclusive) to the end snapshot (inclusive). + * + * @param fromSnapshotId id of the start snapshot + * @param toSnapshotId id of the end snapshot + * @return this for method chaining + */ + GenerateChangeSet betweenSnapshots(long fromSnapshotId, long toSnapshotId); + + /** + * The action result that contains a dataset of changed rows. + */ + interface Result<T> { Review Comment: I personally don't think the action API should be limited to table maintenance. When we added it, the idea was to provide solutions for common problems that require a query engine. Initially, those happened to be mostly related to table maintenance. However, I wouldn't mind actions for other purposes. We all agree to share the planning part across engines via a dedicated scan API. However, we still need to build an engine-specific representation somewhere. Of course, we could have a utility but I am not a big fan of exposing utilities to users. The action API is much more user-facing and requires us to think about the proper interface and keep the compatibility. We had many issues with Spark utils in the past. One thing about using the action API for CDC that concerns me is the need to parameterize the result. Our action API was engine agnostic so far but we need to return Spark `Dataset` here. I am not sure how big of a deal it is. Thoughts, @flyrain @rdblue @RussellSpitzer @szehon-ho @stevenzwu? @rdblue @stevenzwu, I think both of you had concerns about using an action. Could you elaborate a little bit? Given that we will have a common Scan API to share the planning logic. ########## api/src/main/java/org/apache/iceberg/actions/GenerateChangeSet.java: ########## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.actions; + +public interface GenerateChangeSet extends Action<GenerateChangeSet, GenerateChangeSet.Result> { + /** + * Emit changed data set by a snapshot id. + * + * @param snapshotId id of the snapshot to generate changed data + * @return this for method chaining + */ + GenerateChangeSet forSnapshot(long snapshotId); + + /** + * Emit changed data set for the current snapshot. + * + * @return this for method chaining + */ + GenerateChangeSet forCurrentSnapshot(); Review Comment: I am not sure how frequently folks will be calling this. Is there a particular use case? I thought folks would usually have a starting point and would be interested to see changes since that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
