[jira] [Commented] (DRILL-8027) Format plugin for Apache Iceberg

ASF GitHub Bot (Jira) Thu, 11 Nov 2021 22:33:06 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-8027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442587#comment-17442587
 ]


ASF GitHub Bot commented on DRILL-8027:
---------------------------------------

dzamo commented on a change in pull request #2357:
URL: https://github.com/apache/drill/pull/2357#discussion_r747998477



##########
File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/index/ExprToRex.java
##########
@@ -62,31 +61,25 @@ public static RelDataTypeField findField(String fieldName, 
RelDataType rowType)
     return null;
   }
 
-  private RexNode makeItemOperator(String[] paths, int index, RelDataType 
rowType) {
-    if (index == 0) { //last one, return ITEM([0]-inputRef, [1] Literal)
-      final RelDataTypeField field = findField(paths[0], rowType);
-      return field == null ? null : builder.makeInputRef(field.getType(), 
field.getIndex());
-    }
-    return builder.makeCall(SqlStdOperatorTable.ITEM,
-                            makeItemOperator(paths, index - 1, rowType),
-                            builder.makeLiteral(paths[index]));
-  }
-
   @Override
   public RexNode visitSchemaPath(SchemaPath path, Void value) throws 
RuntimeException {
-    PathSegment.NameSegment rootSegment = path.getRootSegment();
-    if (rootSegment.isLastPath()) {
-      final RelDataTypeField field = findField(rootSegment.getPath(), 
newRowType);
-      return field == null ? null : builder.makeInputRef(field.getType(), 
field.getIndex());
-    }
-    List<String> paths = Lists.newArrayList();
-    while (rootSegment != null) {
-      paths.add(rootSegment.getPath());
-      rootSegment = (PathSegment.NameSegment) rootSegment.getChild();
+    PathSegment pathSegment = path.getRootSegment();
+
+    RelDataTypeField field = findField(pathSegment.getNameSegment().getPath(), 
newRowType);
+    RexNode rexNode = field == null ? null : 
builder.makeInputRef(field.getType(), field.getIndex());
+    while (!pathSegment.isLastPath()) {
+      pathSegment = pathSegment.getChild();
+      RexNode ref;
+      if (pathSegment.isNamed()) {
+        ref = builder.makeLiteral(pathSegment.getNameSegment().getPath());
+      } else {
+        ref = 
builder.makeBigintLiteral(BigDecimal.valueOf(pathSegment.getArraySegment().getIndex()));

Review comment:
       Out of curiosity, what is BigDecimal doing here?  Why not just an 
integer?

##########
File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatPlugin.java
##########
@@ -56,7 +59,23 @@
   AbstractWriter getWriter(PhysicalOperator child, String location,
       List<String> partitionColumns) throws IOException;
 
-  Set<StoragePluginOptimizerRule> getOptimizerRules();
+  @Deprecated
+  default Set<? extends RelOptRule> getOptimizerRules() {
+    return Collections.emptySet();
+  }
+
+  default Set<? extends RelOptRule> getOptimizerRules(PlannerPhase phase) {
+    switch (phase) {
+      case PHYSICAL:
+        return getOptimizerRules();
+      case LOGICAL:
+      case JOIN_PLANNING:case LOGICAL_PRUNE_AND_JOIN:

Review comment:
       Is this meant to be on one line?

##########
File path: contrib/format-iceberg/README.md
##########
@@ -0,0 +1,117 @@
+# Apache Iceberg format plugin
+
+This format plugin enabled Drill to query Apache Iceberg tables.
+
+Unlike regular format plugins, the Iceberg table is a folder with data and 
metadata files, but Drill checks the presence
+of the `metadata` folder to ensure that the table is Iceberg one.
+
+Drill supports reading all formats of Iceberg tables available at this moment: 
Parquet, Avro, and ORC.
+No need to provide actual table format, it will be discovered automatically.
+
+For details related to Apache Iceberg table format, please refer to [official 
docs](https://iceberg.apache.org/#).
+
+## Supported optimizations and features
+
+### Project pushdown
+
+This format plugin supports project and filter pushdown optimizations.
+
+For the case of project pushdown, only specified in the query columns will be 
read, even if it is a nested column. In

Review comment:
       ```suggestion
   For the case of project pushdown, only columns specified in the query will 
be read, even they are nested columns.  In
   ```

##########
File path: 
contrib/format-iceberg/src/main/java/org/apache/drill/exec/store/iceberg/IcebergWork.java
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.core.JsonParser;
+import com.fasterxml.jackson.databind.DeserializationContext;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.SerializerProvider;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.fasterxml.jackson.databind.annotation.JsonSerialize;
+import com.fasterxml.jackson.databind.deser.std.StdDeserializer;
+import com.fasterxml.jackson.databind.ser.std.StdSerializer;
+import lombok.Value;
+import lombok.extern.slf4j.Slf4j;
+import org.apache.iceberg.CombinedScanTask;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.util.Base64;
+
+@Value
+@JsonSerialize(using = IcebergWork.IcebergWorkSerializer.class)
+@JsonDeserialize(using = IcebergWork.IcebergWorkDeserializer.class)
+public class IcebergWork {
+  CombinedScanTask scanTask;
+
+  /**
+   * Special deserializer for {@link IcebergWork} class that deserializes
+   * {@code scanTask} filed from byte array string created using {@link 
java.io.Serializable}.

Review comment:
       filed -> field

##########
File path: contrib/format-iceberg/README.md
##########
@@ -0,0 +1,99 @@
+# Apache Iceberg format plugin
+
+This format plugin enabled Drill to query Apache Iceberg tables.
+
+Unlike regular format plugins, the Iceberg table is a folder with data and 
metadata files, but Drill checks the presence
+of the `metadata` folder to ensure that the table is Iceberg one.
+
+Drill supports reading all formats of Iceberg tables available at this moment: 
Parquet, Avro, and ORC.
+No need to provide actual table format, it will be discovered automatically.
+
+For details related to Apache Iceberg table format, please refer to [official 
docs](https://iceberg.apache.org/#).
+
+## Supported optimizations and features
+
+### Project pushdown
+
+This format plugin supports project and filter pushdown optimizations.
+
+For the case of project pushdown, only specified in the query columns will be 
read, even if it is a nested column. In
+conjunction with column-oriented formats like Parquet or ORC, it allows 
improving reading performance significantly.
+
+### Filter pushdown
+
+For the case of filter pushdown, all expressions supported by Iceberg API will 
be pushed down, so only data that matches
+the filter expression will be read.
+
+### Schema provisioning
+
+This format plugin supports the schema provisioning feature. Though Iceberg 
provides table schema, in some cases, it
+might be useful to select data with customized schema, so it can be done using 
the table function:
+
+```sql
+SELECT int_field,
+       string_field
+FROM table(dfs.tmp.testAllTypes(schema => 'inline=(int_field varchar not null 
default `error`)'))
+```
+
+In this example, we convert int field to string and return `'error'` literals 
for null values.
+
+### Querying table metadata
+
+Apache Drill provides the ability to query any kind of table metadata Iceberg 
can return.
+
+At this point, Apache Iceberg has the following metadata kinds:
+
+* ENTRIES
+* FILES
+* HISTORY
+* SNAPSHOTS
+* MANIFESTS
+* PARTITIONS
+* ALL_DATA_FILES
+* ALL_MANIFESTS
+* ALL_ENTRIES
+
+To query specific metadata, just add the `#metadata_name` suffix to the table 
location, like in the following example:
+
+```sql
+SELECT *
+FROM dfs.tmp.`testAllTypes#snapshots`
+```
+
+### Querying specific table versions (snapshots)
+
+Apache Icebergs has the ability to track the table modifications and read 
specific version before or after modifications
+or modifications itself.
+
+This storage plugin embraces this ability and provides an easy-to-use way of 
triggering it.
+
+The following ways of specifying table version are supported:
+
+- `snapshotId` - id of the specific snapshot
+- `snapshotAsOfTime` - the most recent snapshot as of the given time in 
milliseconds
+- `fromSnapshotId` - read appended data from `fromSnapshotId` exclusive to the 
current snapshot inclusive
+- \[`fromSnapshotId` : `toSnapshotId`\] - read appended data from 
`fromSnapshotId` exclusive to `toSnapshotId` inclusive
+
+Table function can be used to specify one of the above configs in the 
following way:
+
+```sql
+SELECT *
+FROM table(dfs.tmp.testAllTypes(type => 'iceberg', snapshotId => 123456789));
+
+SELECT *
+FROM table(dfs.tmp.testAllTypes(type => 'iceberg', snapshotAsOfTime => 
1636231332000));
+
+SELECT *
+FROM table(dfs.tmp.testAllTypes(type => 'iceberg', fromSnapshotId => 
123456789));
+
+SELECT *
+FROM table(dfs.tmp.testAllTypes(type => 'iceberg', fromSnapshotId => 
123456789, toSnapshotId => 987654321));
+```
+
+## Configuration
+
+Format plugin has the following configuration options:
+
+- `type` - format plugin type, should be `'iceberg'`
+- `properties` - Iceberg-specific table properties. Please refer to 
[Configuration](https://iceberg.apache.org/#configuration/) page for more 
details

Review comment:
       I guess just a minimal example if Iceberg has many possible properties.

##########
File path: 
contrib/format-iceberg/src/main/java/org/apache/drill/exec/store/iceberg/IcebergWork.java
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.core.JsonParser;
+import com.fasterxml.jackson.databind.DeserializationContext;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.fasterxml.jackson.databind.SerializerProvider;
+import com.fasterxml.jackson.databind.annotation.JsonDeserialize;
+import com.fasterxml.jackson.databind.annotation.JsonSerialize;
+import com.fasterxml.jackson.databind.deser.std.StdDeserializer;
+import com.fasterxml.jackson.databind.ser.std.StdSerializer;
+import lombok.Value;
+import lombok.extern.slf4j.Slf4j;
+import org.apache.iceberg.CombinedScanTask;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.ObjectOutputStream;
+import java.util.Base64;
+
+@Value
+@JsonSerialize(using = IcebergWork.IcebergWorkSerializer.class)
+@JsonDeserialize(using = IcebergWork.IcebergWorkDeserializer.class)
+public class IcebergWork {
+  CombinedScanTask scanTask;
+
+  /**
+   * Special deserializer for {@link IcebergWork} class that deserializes
+   * {@code scanTask} filed from byte array string created using {@link 
java.io.Serializable}.
+   */
+  @Slf4j
+  public static class IcebergWorkDeserializer extends 
StdDeserializer<IcebergWork> {
+
+    public IcebergWorkDeserializer() {
+      super(IcebergWork.class);
+    }
+
+    @Override
+    public IcebergWork deserialize(JsonParser p, DeserializationContext ctxt) 
throws IOException {
+      JsonNode node = p.getCodec().readTree(p);
+      String scanTaskString = 
node.get(IcebergWorkSerializer.SCAN_TASK_FIELD).asText();
+      try (ObjectInputStream ois = new ObjectInputStream(new 
ByteArrayInputStream(Base64.getDecoder().decode(scanTaskString)))) {
+        Object scanTask = ois.readObject();
+        return new IcebergWork((CombinedScanTask) scanTask);
+      } catch (ClassNotFoundException e) {
+        logger.error(e.getMessage(), e);
+      }
+
+      return null;
+    }
+  }
+
+  /**
+   * Special serializer for {@link IcebergWork} class that serializes
+   * {@code scanTask} filed to byte array string created using {@link 
java.io.Serializable}

Review comment:
       filed -> field

##########
File path: contrib/format-iceberg/README.md
##########
@@ -0,0 +1,117 @@
+# Apache Iceberg format plugin
+
+This format plugin enabled Drill to query Apache Iceberg tables.
+
+Unlike regular format plugins, the Iceberg table is a folder with data and 
metadata files, but Drill checks the presence
+of the `metadata` folder to ensure that the table is Iceberg one.
+
+Drill supports reading all formats of Iceberg tables available at this moment: 
Parquet, Avro, and ORC.
+No need to provide actual table format, it will be discovered automatically.
+
+For details related to Apache Iceberg table format, please refer to [official 
docs](https://iceberg.apache.org/#).
+
+## Supported optimizations and features
+
+### Project pushdown
+
+This format plugin supports project and filter pushdown optimizations.
+
+For the case of project pushdown, only specified in the query columns will be 
read, even if it is a nested column. In
+conjunction with column-oriented formats like Parquet or ORC, it allows 
improving reading performance significantly.
+
+### Filter pushdown
+
+For the case of filter pushdown, all expressions supported by Iceberg API will 
be pushed down, so only data that matches
+the filter expression will be read.
+
+### Schema provisioning
+
+This format plugin supports the schema provisioning feature. Though Iceberg 
provides table schema, in some cases, it
+might be useful to select data with customized schema, so it can be done using 
the table function:
+
+```sql
+SELECT int_field,
+       string_field
+FROM table(dfs.tmp.testAllTypes(schema => 'inline=(int_field varchar not null 
default `error`)'))
+```
+
+In this example, we convert int field to string and return `'error'` literals 
for null values.
+
+### Querying table metadata
+
+Apache Drill provides the ability to query any kind of table metadata Iceberg 
can return.
+
+At this point, Apache Iceberg has the following metadata kinds:
+
+* ENTRIES
+* FILES
+* HISTORY
+* SNAPSHOTS
+* MANIFESTS
+* PARTITIONS
+* ALL_DATA_FILES
+* ALL_MANIFESTS
+* ALL_ENTRIES
+
+To query specific metadata, just add the `#metadata_name` suffix to the table 
location, like in the following example:
+
+```sql
+SELECT *
+FROM dfs.tmp.`testAllTypes#snapshots`
+```
+
+### Querying specific table versions (snapshots)
+
+Apache Icebergs has the ability to track the table modifications and read 
specific version before or after modifications
+or modifications itself.
+
+This storage plugin embraces this ability and provides an easy-to-use way of 
triggering it.

Review comment:
       ```suggestion
   This format plugin embraces this ability and provides an easy-to-use way of 
triggering it.
   ```

##########
File path: contrib/format-iceberg/README.md
##########
@@ -0,0 +1,99 @@
+# Apache Iceberg format plugin
+
+This format plugin enabled Drill to query Apache Iceberg tables.
+
+Unlike regular format plugins, the Iceberg table is a folder with data and 
metadata files, but Drill checks the presence
+of the `metadata` folder to ensure that the table is Iceberg one.
+
+Drill supports reading all formats of Iceberg tables available at this moment: 
Parquet, Avro, and ORC.
+No need to provide actual table format, it will be discovered automatically.
+
+For details related to Apache Iceberg table format, please refer to [official 
docs](https://iceberg.apache.org/#).
+
+## Supported optimizations and features
+
+### Project pushdown
+
+This format plugin supports project and filter pushdown optimizations.
+
+For the case of project pushdown, only specified in the query columns will be 
read, even if it is a nested column. In
+conjunction with column-oriented formats like Parquet or ORC, it allows 
improving reading performance significantly.
+
+### Filter pushdown
+
+For the case of filter pushdown, all expressions supported by Iceberg API will 
be pushed down, so only data that matches
+the filter expression will be read.
+
+### Schema provisioning
+
+This format plugin supports the schema provisioning feature. Though Iceberg 
provides table schema, in some cases, it
+might be useful to select data with customized schema, so it can be done using 
the table function:
+
+```sql
+SELECT int_field,
+       string_field
+FROM table(dfs.tmp.testAllTypes(schema => 'inline=(int_field varchar not null 
default `error`)'))
+```
+
+In this example, we convert int field to string and return `'error'` literals 
for null values.
+
+### Querying table metadata
+
+Apache Drill provides the ability to query any kind of table metadata Iceberg 
can return.
+
+At this point, Apache Iceberg has the following metadata kinds:
+
+* ENTRIES
+* FILES
+* HISTORY
+* SNAPSHOTS
+* MANIFESTS
+* PARTITIONS
+* ALL_DATA_FILES
+* ALL_MANIFESTS
+* ALL_ENTRIES
+
+To query specific metadata, just add the `#metadata_name` suffix to the table 
location, like in the following example:
+
+```sql
+SELECT *
+FROM dfs.tmp.`testAllTypes#snapshots`
+```
+
+### Querying specific table versions (snapshots)
+
+Apache Icebergs has the ability to track the table modifications and read 
specific version before or after modifications

Review comment:
       ```suggestion
   Apache Iceberg has the ability to track the table modifications and read 
specific version before or after modifications
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Format plugin for Apache Iceberg
> --------------------------------
>
>                 Key: DRILL-8027
>                 URL: https://issues.apache.org/jira/browse/DRILL-8027
>             Project: Apache Drill
>          Issue Type: New Feature
>    Affects Versions: 1.20.0
>            Reporter: Vova Vysotskyi
>            Assignee: Vova Vysotskyi
>            Priority: Major
>              Labels: plugin
>             Fix For: 1.20.0
>
>
> Implement a format plugin for Apache Iceberg.
> Plugin should be able to:
> - support reading data from Iceberg tables in Parquet, Avro, and ORC formats
> - push down fields used in the project
> - push down supported filter expressions
> - spit and parallelize reading tasks
> - provide a way for specifying Iceberg-specific configurations
> - read specific snapshot versions if configured
> - read table metadata (entries, files, history, snapshots, manifests, 
> partitions, etc.)
> - support schema provisioning



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (DRILL-8027) Format plugin for Apache Iceberg

Reply via email to