[jira] [Commented] (PARQUET-2006) Column resolution by ID

ASF GitHub Bot (Jira) Mon, 04 Apr 2022 12:59:05 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517081#comment-17517081
 ]


ASF GitHub Bot commented on PARQUET-2006:
-----------------------------------------

rdblue commented on code in PR #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r842091649


##########
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java:
##########
@@ -170,6 +174,24 @@ public Void visit(Not not) {
 
   private <T extends Comparable<T>> void validateColumn(Column<T> column) {
     ColumnPath path = column.getColumnPath();
+    if (path == null) {
+      HashSet<Type.ID> ids = new HashSet<>();
+      Type.ID id = column.getColumnId();
+      List<ColumnDescriptor> columnDescriptors = new 
ArrayList<>(columnsAccordingToSchema.values());
+      for (ColumnDescriptor columnDescriptor : columnDescriptors) {
+        Type.ID columnId = columnDescriptor.getId();
+        if (columnId != null) {
+          if (ids.contains(columnId)) {
+            throw new RuntimeException("duplicate id");
+          }
+          ids.add(columnId);
+          if (columnId.intValue() == id.intValue()) {
+            column.setColumnPath(ColumnPath.get(columnDescriptor.getPath()));

Review Comment:
   I appreciate not wanting to change the validation logic, but making the 
filter `Column` mutable just to be able to add the path seems like a bad idea 
in general. What happens if the filter is reused and the path changes? It looks 
like this would use the first path for a column that was encountered.
   
   I think this should validate directly using IDs instead.





> Column resolution by ID
> -----------------------
>
>                 Key: PARQUET-2006
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2006
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>
> Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
> would be a problem. Iceberg uses ID and stored Id/name mappings. 
> This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2006) Column resolution by ID

Reply via email to