Re: [PR] [HUDI-7930] Flink Support for Array of Row and Map of Row value [hudi]

via GitHub Sat, 28 Sep 2024 07:26:43 -0700


David-N-Perkins commented on code in PR #11727:
URL: https://github.com/apache/hudi/pull/11727#discussion_r1779498859



##########
hudi-flink-datasource/hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java:
##########
@@ -546,15 +615,30 @@ private static WritableColumnVector 
createWritableColumnVector(
           // schema evolution: read the file with a new extended field name.
           int fieldIndex = 
getFieldIndexInPhysicalType(rowType.getFields().get(i).getName(), groupType);
           if (fieldIndex < 0) {
-            columnVectors[i] = (WritableColumnVector) 
createVectorFromConstant(rowType.getTypeAt(i), null, batchSize);
+            if (groupType.getRepetition().equals(Type.Repetition.REPEATED) && 
!rowType.getTypeAt(i).is(LogicalTypeRoot.ARRAY)) {
+              columnVectors[i] = (WritableColumnVector) 
createVectorFromConstant(
+                  new ArrayType(rowType.getTypeAt(i).isNullable(), 
rowType.getTypeAt(i)), null, batchSize);
+            } else {
+              columnVectors[i] = (WritableColumnVector) 
createVectorFromConstant(rowType.getTypeAt(i), null, batchSize);
+            }
           } else {
-            columnVectors[i] =
-                createWritableColumnVector(
-                    batchSize,
-                    rowType.getTypeAt(i),
-                    groupType.getType(fieldIndex),
-                    descriptors,
-                    depth + 1);
+            if (descriptors.get(fieldIndex).getMaxRepetitionLevel() > 0 && 
!rowType.getTypeAt(i).is(LogicalTypeRoot.ARRAY)) {
+              columnVectors[i] =
+                  createWritableColumnVector(
+                      batchSize,
+                      new ArrayType(rowType.getTypeAt(i).isNullable(), 
rowType.getTypeAt(i)),

Review Comment:
   This is done here, at line 470, and in some other files in order to meet the 
Parquet field algorithm that pushes multiplicity and structures down to 
individual fields. In Parquet, an array of rows is stored as separate arrays 
for each field.
   This approach does have some limitations. It won't work for multiple nested 
arrays and maps. The main problem is that the Flink classes and interface don't 
follow that pattern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7930] Flink Support for Array of Row and Map of Row value [hudi]

Reply via email to