David-N-Perkins commented on code in PR #11727:
URL: https://github.com/apache/hudi/pull/11727#discussion_r1779498859
##########
hudi-flink-datasource/hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java:
##########
@@ -546,15 +615,30 @@ private static WritableColumnVector
createWritableColumnVector(
// schema evolution: read the file with a new extended field name.
int fieldIndex =
getFieldIndexInPhysicalType(rowType.getFields().get(i).getName(), groupType);
if (fieldIndex < 0) {
- columnVectors[i] = (WritableColumnVector)
createVectorFromConstant(rowType.getTypeAt(i), null, batchSize);
+ if (groupType.getRepetition().equals(Type.Repetition.REPEATED) &&
!rowType.getTypeAt(i).is(LogicalTypeRoot.ARRAY)) {
+ columnVectors[i] = (WritableColumnVector)
createVectorFromConstant(
+ new ArrayType(rowType.getTypeAt(i).isNullable(),
rowType.getTypeAt(i)), null, batchSize);
+ } else {
+ columnVectors[i] = (WritableColumnVector)
createVectorFromConstant(rowType.getTypeAt(i), null, batchSize);
+ }
} else {
- columnVectors[i] =
- createWritableColumnVector(
- batchSize,
- rowType.getTypeAt(i),
- groupType.getType(fieldIndex),
- descriptors,
- depth + 1);
+ if (descriptors.get(fieldIndex).getMaxRepetitionLevel() > 0 &&
!rowType.getTypeAt(i).is(LogicalTypeRoot.ARRAY)) {
+ columnVectors[i] =
+ createWritableColumnVector(
+ batchSize,
+ new ArrayType(rowType.getTypeAt(i).isNullable(),
rowType.getTypeAt(i)),
Review Comment:
This is done here, at line 470, and in some other files in order to meet the
Parquet field algorithm that pushes multiplicity and structures down to
individual fields. In Parquet, an array of rows is stored as separate arrays
for each field.
This approach does have some limitations. It won't work for multiple nested
arrays and maps. The main problem is that the Flink classes and interface don't
follow that pattern.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]