[GitHub] [parquet-mr] rdblue commented on a diff in pull request #950: PARQUET-2006: Column resolution by ID

GitBox Mon, 04 Apr 2022 12:58:52 -0700


rdblue commented on code in PR #950:
URL: https://github.com/apache/parquet-mr/pull/950#discussion_r842091649



##########
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/SchemaCompatibilityValidator.java:
##########
@@ -170,6 +174,24 @@ public Void visit(Not not) {
 
   private <T extends Comparable<T>> void validateColumn(Column<T> column) {
     ColumnPath path = column.getColumnPath();
+    if (path == null) {
+      HashSet<Type.ID> ids = new HashSet<>();
+      Type.ID id = column.getColumnId();
+      List<ColumnDescriptor> columnDescriptors = new 
ArrayList<>(columnsAccordingToSchema.values());
+      for (ColumnDescriptor columnDescriptor : columnDescriptors) {
+        Type.ID columnId = columnDescriptor.getId();
+        if (columnId != null) {
+          if (ids.contains(columnId)) {
+            throw new RuntimeException("duplicate id");
+          }
+          ids.add(columnId);
+          if (columnId.intValue() == id.intValue()) {
+            column.setColumnPath(ColumnPath.get(columnDescriptor.getPath()));

Review Comment:
   I appreciate not wanting to change the validation logic, but making the 
filter `Column` mutable just to be able to add the path seems like a bad idea 
in general. What happens if the filter is reused and the path changes? It looks 
like this would use the first path for a column that was encountered.
   
   I think this should validate directly using IDs instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] rdblue commented on a diff in pull request #950: PARQUET-2006: Column resolution by ID

Reply via email to