[jira] [Commented] (PARQUET-2006) Column resolution by ID

ASF GitHub Bot (Jira) Mon, 04 Apr 2022 13:06:04 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517086#comment-17517086
 ]


ASF GitHub Bot commented on PARQUET-2006:
-----------------------------------------

rdblue commented on code in PR #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r842097255


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -878,11 +880,97 @@ public String getFile() {
     return blocks;
   }
 
-  public void setRequestedSchema(MessageType projection) {
+  private boolean uniqueId(GroupType schema, HashSet<Type.ID> ids) {
+    boolean unique = true;
+    List<Type> fields = schema.getFields();
+    for (Type field : fields) {
+      if (field instanceof PrimitiveType) {
+        Type.ID id = field.getId();
+        if (id != null) {
+          if (ids.contains(id)) {
+            return false;
+          }
+          ids.add(id);
+        }
+      }
+
+      if (field instanceof GroupType) {
+        Type.ID id = field.getId();
+        if (id != null) {
+          if (ids.contains(id)) {
+            return false;
+          }
+          ids.add(id);
+        }
+        if (unique) unique = uniqueId(field.asGroupType(), ids);
+      }
+    }
+    return unique;
+  }
+
+  public MessageType setRequestedSchema(MessageType projection, boolean 
useColumnId) {
     paths.clear();
-    for (ColumnDescriptor col : projection.getColumns()) {
+    MessageType schema = null;
+    if (useColumnId) {
+      HashSet<Type.ID> ids = new HashSet<>();
+      boolean fileSchemaIdUnique = uniqueId(fileMetaData.getSchema(), ids);
+      if (!fileSchemaIdUnique) {
+        throw new RuntimeException("can't use column id resolution because 
there are duplicate column ids.");
+      }
+      ids = new HashSet<>();
+      boolean projectionSchemaIdUnique = uniqueId(projection, ids);
+      if (!projectionSchemaIdUnique) {
+        throw new RuntimeException("can't use column id resolution because 
there are duplicate column ids.");
+      }
+      schema = resetColumnNameBasedOnId(projection);
+    } else {
+      schema = projection;
+    }
+    for (ColumnDescriptor col : schema.getColumns()) {
       paths.put(ColumnPath.get(col.getPath()), col);
     }
+    return schema;
+  }
+
+  private MessageType resetColumnNameBasedOnId(MessageType schema) {

Review Comment:
   This doesn't seem like the right class to contain these utility methods. I'd 
recommend a utility class to handle this.





> Column resolution by ID
> -----------------------
>
>                 Key: PARQUET-2006
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2006
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2006) Column resolution by ID

Reply via email to