clintropolis commented on code in PR #13294:
URL: https://github.com/apache/druid/pull/13294#discussion_r1012286113
##########
extensions-core/parquet-extensions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupConverter.java:
##########
@@ -173,12 +183,38 @@ optional group my_list (LIST) {
vals.add(convertPrimitiveField(g, 0, i, binaryAsString));
} else {
Group listItem = g.getGroup(0, i);
- vals.add(listItem);
+ vals.add(convertListElement(listItem, binaryAsString));
}
}
return vals;
}
+ private static Object convertListElement(Group listItem, boolean
binaryAsString)
+ {
+ if (
+ listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+ !listItem.getType().isPrimitive() &&
+ listItem.getType().asGroupType().getFieldCount() == 1 &&
+ listItem.getType().getFields().get(0).isPrimitive()
+ ) {
+ // nullable primitive list elements can have a repeating wrapper
element, peel it off
+ return convertPrimitiveField(listItem, 0, binaryAsString);
+ } else if (
+ listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+ listItem.getFieldRepetitionCount(0) == 1 &&
+ listItem.getType().getFieldCount() == 1 &&
+ listItem.getType().getName().equalsIgnoreCase("list") &&
+ listItem.getType().getFieldName(0).equalsIgnoreCase("element") &&
+ listItem.getGroup(0,
0).getType().isRepetition(Type.Repetition.OPTIONAL)
Review Comment:
>should listItem.getType().getFieldCount() == 1 be checked just after
listItem.getType().isRepetition(Type.Repetition.REPEATED)
yeah, i think that makes more sense, will fix
>could we use
listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType())
for list check and assume the fieldName for 0 index to be element 'type'? was
checking
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
and it seemed like the name of the field in the list block could be anything
and not always 'element'. totally fine if you have thought about this and want
to tackle a particular case
it 100% does seem better to use `LogicalTypeAnnotation` stuff. Poking at
this fix was the first time I've seen these things because it didn't actually
exist when I originally wrote this code in 2018 using parquet-mr 1.10.0, but
this functionality was added in 1.11.0 😅. I'll look into how difficult it is to
translate this (and probably logical map conversion) to use this instead of
`OriginalType` and checking stuff by hand.
>should we do convertListElement(listItem.getGroup(0, 0) in the return?
I think it isn't necessary the way things work right now; at the
`ParquetGroupConverter` layer, this conversion is lazy in the sense that it
only converts the current field being asked for and isn't recursive. We do want
to remove the extra wrapper, but thats all we want to do here. The interaction
between jsonpath/jq paths in the json provider digs through things to find
stuff, and the flattenermaker performs any final cleanup for things leaving
conversion before the make it into druid, so the recursive conversion is done
there (well before this patch these things could actually leak, that is a
change in this PR in preparation for supporting nested column stuffs).
##########
extensions-core/parquet-extensions/src/main/java/org/apache/druid/data/input/parquet/simple/ParquetGroupConverter.java:
##########
@@ -173,12 +183,38 @@ optional group my_list (LIST) {
vals.add(convertPrimitiveField(g, 0, i, binaryAsString));
} else {
Group listItem = g.getGroup(0, i);
- vals.add(listItem);
+ vals.add(convertListElement(listItem, binaryAsString));
}
}
return vals;
}
+ private static Object convertListElement(Group listItem, boolean
binaryAsString)
+ {
+ if (
+ listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+ !listItem.getType().isPrimitive() &&
+ listItem.getType().asGroupType().getFieldCount() == 1 &&
+ listItem.getType().getFields().get(0).isPrimitive()
+ ) {
+ // nullable primitive list elements can have a repeating wrapper
element, peel it off
+ return convertPrimitiveField(listItem, 0, binaryAsString);
+ } else if (
+ listItem.getType().isRepetition(Type.Repetition.REPEATED) &&
+ listItem.getFieldRepetitionCount(0) == 1 &&
+ listItem.getType().getFieldCount() == 1 &&
+ listItem.getType().getName().equalsIgnoreCase("list") &&
+ listItem.getType().getFieldName(0).equalsIgnoreCase("element") &&
+ listItem.getGroup(0,
0).getType().isRepetition(Type.Repetition.OPTIONAL)
Review Comment:
>should listItem.getType().getFieldCount() == 1 be checked just after
listItem.getType().isRepetition(Type.Repetition.REPEATED)
yeah, i think that makes more sense, will fix
>could we use
listItem.getType().getLogicalTypeAnnotation().equals(LogicalTypeAnnotation.listType())
for list check and assume the fieldName for 0 index to be element 'type'? was
checking
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
and it seemed like the name of the field in the list block could be anything
and not always 'element'. totally fine if you have thought about this and want
to tackle a particular case
it 100% does seem better to use `LogicalTypeAnnotation` stuff. Poking at
this fix was the first time I've seen these things because it didn't actually
exist when I originally wrote this code in 2018 using parquet-mr 1.10.0, but
this functionality was added in 1.11.0 😅. I'll look into how difficult it is to
translate this (and probably logical map conversion) to use this instead of
`OriginalType` and checking stuff by hand.
>should we do convertListElement(listItem.getGroup(0, 0) in the return?
I think it isn't necessary the way things work right now; at the
`ParquetGroupConverter` layer, this conversion is lazy in the sense that it
only converts the current field being asked for and isn't recursive. We do want
to remove the extra wrapper, but thats all we want to do here. The interaction
between jsonpath/jq paths in the json provider digs through things to find
stuff, and the flattenermaker performs any final cleanup for things leaving
conversion before the make it into druid, so the recursive conversion is done
there (well before this patch these things could actually leak, that is a
change in this PR in preparation for supporting nested column stuffs).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]