jackye1995 commented on code in PR #9728:
URL: https://github.com/apache/iceberg/pull/9728#discussion_r1492816128
##########
core/src/main/java/org/apache/iceberg/FileScanTaskParser.java:
##########
@@ -21,116 +21,84 @@
import com.fasterxml.jackson.core.JsonGenerator;
import com.fasterxml.jackson.databind.JsonNode;
import java.io.IOException;
-import org.apache.iceberg.expressions.Expression;
-import org.apache.iceberg.expressions.ExpressionParser;
-import org.apache.iceberg.expressions.Expressions;
-import org.apache.iceberg.expressions.ResidualEvaluator;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
-import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.base.Strings;
import org.apache.iceberg.util.JsonUtil;
public class FileScanTaskParser {
- private static final String SCHEMA = "schema";
- private static final String SPEC = "spec";
- private static final String DATA_FILE = "data-file";
- private static final String START = "start";
- private static final String LENGTH = "length";
- private static final String DELETE_FILES = "delete-files";
- private static final String RESIDUAL = "residual-filter";
+ private static final String TASK_TYPE = "task-type";
+
+ private enum TaskType {
Review Comment:
I feel the Java class names are starting to get confusing, even at the
interface level. Although `DataTask` extends `FileScanTask`, there is really no
file that the `DataTask` is scanning. So although `StaticDataTask` and
`BaseFileScanTask` both implements `FileScanTask`, they are not really
conceptually similar. The actual relationship is more like `StaticDataTask`
implements `DataTask`, `BaseFileScanTask` implements `FileScanTask`, but
`DataTask` and `FileScanTask` have many things in common so we made `DataTask`
extends `FileScanTask`.
To resolve this confusing situation and potentially accommodate other types
of scan task, I am thinking if we should go one more layer above, so we keep
the file scan task spec as is, and on top of that, have a serialization spec
for **scan task**, which has types like `file-scan-task`, `data-task`. And we
then have `ScanTaskParser` that delegates to the existing `FileScanTaskParser`
or the `DataTaskParser`. What do you think?
##########
format/spec.md:
##########
@@ -1237,17 +1237,36 @@ Content file (data or delete) is serialized as a JSON
object according to the fo
| **`equality-ids`** |`JSON list of int: Field ids used to determine row
equality in equality delete files`|`[1]`|
| **`sort-order-id`** |`JSON int`|`1`|
-### File Scan Task Serialization
-
-File scan task is serialized as a JSON object according to the following table.
-
-| Metadata field |JSON representation|Example|
-|--------------------------|--- |--- |
-| **`schema`** |`JSON object`|`See above, read schemas instead`|
-| **`spec`** |`JSON object`|`See above, read partition specs
instead`|
-| **`data-file`** |`JSON object`|`See above, read content file instead`|
-| **`delete-files`** |`JSON list of objects`|`See above, read content file
instead`|
-| **`residual-filter`** |`JSON object: residual filter
expression`|`{"type":"eq","term":"id","value":1}`|
+### Task Serialization
+
+There are different task implementations, e.g. `BaseFileScanTask` and
`StaticDataTask` in Java.
+A `task-type` field is needed to distinguish different task types.
+
+| Metadata field | JSON representation | Example
|
+|-----------------|---------------------|------------------------------------------------------------------------------------------------|
+| **`task-type`** | `JSON string` | `file-scan-task`, `data-task`.
Absence of this field should be interpreted as `file-scan-task` |
+
+`file-scan-task` represents a scan task with a data file and maybe delete
files.
Review Comment:
nit: "and maybe delete files" -> "and optional delete files that should be
applied to the data file"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]