clintropolis commented on code in PR #13294:
URL: https://github.com/apache/druid/pull/13294#discussion_r1012286113


##########
extensions-core/parquet-extensions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupConverter.java:
##########
@@ -173,12 +183,38 @@ optional group my_list (LIST) {
         vals.add(convertPrimitiveField(g, 0, i, binaryAsString));
       } else {
         Group listItem = g.getGroup(0, i);
-        vals.add(listItem);
+        vals.add(convertListElement(listItem, binaryAsString));
       }
     }
     return vals;
   }
 
+  private static Object convertListElement(Group listItem, boolean 
binaryAsString)
+  {
+    if (
+        listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+        !listItem.getType().isPrimitive() &&
+        listItem.getType().asGroupType().getFieldCount() == 1 &&
+        listItem.getType().getFields().get(0).isPrimitive()
+    ) {
+      // nullable primitive list elements can have a repeating wrapper 
element, peel it off
+      return convertPrimitiveField(listItem, 0, binaryAsString);
+    } else if (
+        listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+        listItem.getFieldRepetitionCount(0) == 1 &&
+        listItem.getType().getFieldCount() == 1 &&
+        listItem.getType().getName().equalsIgnoreCase("list") &&
+        listItem.getType().getFieldName(0).equalsIgnoreCase("element") &&
+        listItem.getGroup(0, 
0).getType().isRepetition(Type.Repetition.OPTIONAL)

Review Comment:
   >should listItem.getType().getFieldCount() == 1 be checked just after 
listItem.getType().isRepetition(Type.Repetition.REPEATED)
   
   yeah, i think that makes more sense, will fix
   
   >could we use 
listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType())
 for list check and assume the fieldName for 0 index to be element 'type'? was 
checking 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
 and it seemed like the name of the field in the list block could be anything 
and not always 'element'. totally fine if you have thought about this and want 
to tackle a particular case
   
   it 100% does seem better to use `LogicalTypeAnnotation` stuff. Poking at 
this fix was the first time I've seen these things because it didn't actually 
exist when I originally wrote this code in 2018 using parquet-mr 1.10.0, but 
this functionality was added in 1.11.0 😅. I'll look into how difficult it is to 
translate this (and probably logical map conversion) to use this instead of 
`OriginalType` and checking stuff by hand.
   
   >should we do convertListElement(listItem.getGroup(0, 0) in the return?
   I think it isn't necessary the way things work right now; at the 
`ParquetGroupConverter` layer, this conversion is lazy in the sense that it 
only converts the current field being asked for and isn't recursive. We do want 
to remove the extra wrapper, but thats all we want to do here. The interaction 
between jsonpath/jq paths in the json provider digs through things to find 
stuff, and the flattenermaker performs any final cleanup for things leaving 
conversion before the make it into druid, so the recursive conversion is done 
there (well before this patch these things could actually leak, that is a 
change in this PR in preparation for supporting nested column stuffs).



##########
extensions-core/parquet-extensions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupConverter.java:
##########
@@ -173,12 +183,38 @@ optional group my_list (LIST) {
         vals.add(convertPrimitiveField(g, 0, i, binaryAsString));
       } else {
         Group listItem = g.getGroup(0, i);
-        vals.add(listItem);
+        vals.add(convertListElement(listItem, binaryAsString));
       }
     }
     return vals;
   }
 
+  private static Object convertListElement(Group listItem, boolean 
binaryAsString)
+  {
+    if (
+        listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+        !listItem.getType().isPrimitive() &&
+        listItem.getType().asGroupType().getFieldCount() == 1 &&
+        listItem.getType().getFields().get(0).isPrimitive()
+    ) {
+      // nullable primitive list elements can have a repeating wrapper 
element, peel it off
+      return convertPrimitiveField(listItem, 0, binaryAsString);
+    } else if (
+        listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+        listItem.getFieldRepetitionCount(0) == 1 &&
+        listItem.getType().getFieldCount() == 1 &&
+        listItem.getType().getName().equalsIgnoreCase("list") &&
+        listItem.getType().getFieldName(0).equalsIgnoreCase("element") &&
+        listItem.getGroup(0, 
0).getType().isRepetition(Type.Repetition.OPTIONAL)

Review Comment:
   >should listItem.getType().getFieldCount() == 1 be checked just after 
listItem.getType().isRepetition(Type.Repetition.REPEATED)
   
   yeah, i think that makes more sense, will fix
   
   >could we use 
listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType())
 for list check and assume the fieldName for 0 index to be element 'type'? was 
checking 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
 and it seemed like the name of the field in the list block could be anything 
and not always 'element'. totally fine if you have thought about this and want 
to tackle a particular case
   
   it 100% does seem better to use `LogicalTypeAnnotation` stuff. Poking at 
this fix was the first time I've seen these things because it didn't actually 
exist when I originally wrote this code in 2018 using parquet-mr 1.10.0, but 
this functionality was added in 1.11.0 😅. I'll look into how difficult it is to 
translate this (and probably logical map conversion) to use this instead of 
`OriginalType` and checking stuff by hand.
   
   >should we do convertListElement(listItem.getGroup(0, 0) in the return?
   
   I think it isn't necessary the way things work right now; at the 
`ParquetGroupConverter` layer, this conversion is lazy in the sense that it 
only converts the current field being asked for and isn't recursive. We do want 
to remove the extra wrapper, but thats all we want to do here. The interaction 
between jsonpath/jq paths in the json provider digs through things to find 
stuff, and the flattenermaker performs any final cleanup for things leaving 
conversion before the make it into druid, so the recursive conversion is done 
there (well before this patch these things could actually leak, that is a 
change in this PR in preparation for supporting nested column stuffs).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to