Re: [PR] [HUDI-8552] Use fg reader for compaction [hudi]

via GitHub Wed, 27 Nov 2024 10:57:11 -0800


yihua commented on code in PR #12343:
URL: https://github.com/apache/hudi/pull/12343#discussion_r1861129515



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkTaskContextSupplier.java:
##########
@@ -90,4 +100,18 @@ public Option<String> getProperty(EngineProperty prop) {
     throw new HoodieException("Unknown engine property :" + prop);
   }
 
+  // This reader context is used to read records before write, like 
compaction, clustering.
+  @Override
+  public Option<HoodieReaderContext> getReaderContext(HoodieTableMetaClient 
metaClient, boolean useReaderContext) {
+    if (useReaderContext) {
+      SparkParquetReader reader = 
SparkAdapterSupport$.MODULE$.sparkAdapter().createParquetFileReader(
+          false, SQLConf.get(), new HashMap<>(), (Configuration) 
metaClient.getStorageConf().unwrap());
+      return Option.of(new SparkFileFormatInternalRowReaderContext(
+          reader,
+          metaClient.getTableConfig().getRecordKeyFields().get()[0],
+          new ArrayBuffer<>(),
+          new ArrayBuffer<>()));
+    }
+    return Option.empty();
+  }

Review Comment:
   As I mentioned, you'll need to first instantiate parquet reader at the 
driver, and then broadcast that to the executor, to follow the same as 
`HoodieFileGroupReaderBasedParquetFileFormat` 
(https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedParquetFileFormat.scala#L152)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8552] Use fg reader for compaction [hudi]

Reply via email to