Re: [PR] [HUDI-8340] Fixing functional index record generation using spark distributed computation [hudi]

via GitHub Sat, 19 Oct 2024 18:23:56 -0700


yihua commented on code in PR #12127:
URL: https://github.com/apache/hudi/pull/12127#discussion_r1807469156



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -570,6 +584,9 @@ private HoodieIndexDefinition 
getFunctionalIndexDefinition(String indexName) {
   }
 
   private Set<String> getIndexPartitionsToInit(MetadataPartitionType 
partitionType) {
+    if (dataMetaClient.getIndexMetadata().isEmpty()) {

Review Comment:
   ```suggestion
       if (dataMetaClient.getFunctionalAndSecondaryIndexMetadata().isEmpty()) {
   ```



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java:
##########
@@ -174,24 +185,40 @@ protected HoodieData<HoodieRecord> 
getFunctionalIndexRecords(List<Pair<String, F
     String columnToIndex = indexDefinition.getSourceFields().get(0);
     SQLContext sqlContext = sparkEngineContext.getSqlContext();
 
-    // Group FileSlices by partition
-    Map<String, List<FileSlice>> partitionToFileSlicesMap = 
partitionFileSlicePairs.stream()
-        .collect(Collectors.groupingBy(Pair::getKey, 
Collectors.mapping(Pair::getValue, Collectors.toList())));
+    // Read records and append functional index metadata to every row
+    HoodieData<Row> rowData = 
sparkEngineContext.parallelize(partitionFilePathPairs, parallelism)
+        .flatMap((SerializableFunction<Pair<String, Pair<String, Long>>, 
Iterator<Row>>) entry -> {
+          String partition = entry.getKey();
+          Pair<String, Long> filePathSizePair = entry.getValue();
+          String filePath = filePathSizePair.getKey();
+          long fileSize = filePathSizePair.getValue();
+          List<Row> rowsForFilePath = readRecordsAsRows(new StoragePath[] {new 
StoragePath(filePath)}, sqlContext, metaClient, readerSchema, dataWriteConfig,
+              FSUtils.isBaseFile(new 
StoragePath(filePath.substring(filePath.lastIndexOf("/") + 1))));
+          List<Row> rowsWithIndexMetadata = 
SparkMetadataWriterUtils.getRowsWithFunctionalIndexMetadata(rowsForFilePath, 
partition, filePath, fileSize);
+          return rowsWithIndexMetadata.iterator();

Review Comment:
   Let's create a follow-up to use `HoodieData` and extract common logic to 
`HoodieBackedTableMetadataWriter` (reading stats, files, etc.) and only leave 
Spark specific logic (e.g., Spark transformations) in this class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8340] Fixing functional index record generation using spark distributed computation [hudi]

Reply via email to