Re: [PR] [HUDI-8746] Add ORC support to FileGroupReader paths and migrate Spark multi-format reading to FileGroupReader [hudi]

via GitHub Mon, 28 Jul 2025 15:56:56 -0700


jonvex commented on code in PR #13632:
URL: https://github.com/apache/hudi/pull/13632#discussion_r2237990007



##########
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##########
@@ -1248,7 +1249,10 @@ private static Object 
rewritePrimaryTypeWithDiffSchemaType(Object oldValue, Sche
           return String.valueOf(oldValue);
         }
         if (oldSchema.getType() == Schema.Type.BYTES) {
-          return String.valueOf(oldValue);
+          if (oldValue instanceof ByteBuffer) {

Review Comment:
   In my schema evolution pr I changed this to 
`StringUtils.fromUTF8Bytes(((ByteBuffer) oldValue).array());`. Will this do the 
same thing?



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/SparkReaderContextFactory.java:
##########
@@ -88,18 +92,26 @@ class SparkReaderContextFactory implements 
ReaderContextFactory<InternalRow> {
     configs.set(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE().key(), 
sqlConf.getConfString(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE().key()));
     configs.set(SQLConf.PARQUET_WRITE_LEGACY_FORMAT().key(), 
sqlConf.getConfString(SQLConf.PARQUET_WRITE_LEGACY_FORMAT().key()));
     configurationBroadcast = jsc.broadcast(new 
SerializableConfiguration(configs));
-    // Broadcast: ParquetReader.
-    // Spark parquet reader has to be instantiated on the driver and broadcast 
to the executors
-    SparkParquetReader parquetFileReader = 
sparkAdapter.createParquetFileReader(false, sqlConf, options, configs);
-    parquetReaderBroadcast = jsc.broadcast(parquetFileReader);
+    // Broadcast: BaseFilereader.

Review Comment:
   is there any downsides to just doing this for all tables. If there isn't 
much cost then it would be easier to maintain to just use the multiple base 
formats supported path



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##########
@@ -61,20 +64,21 @@ trait HoodieFormatTrait {
 
 /**
  * This class utilizes {@link HoodieFileGroupReader} and its related classes 
to support reading
- * from Parquet formatted base files and their log files.
+ * from Parquet or ORC formatted base files and their log files.
  */

Review Comment:
   I'm generally ok with this, but there is a method that is returning that it 
is a parquet format, and if you change it to something else then there will be 
optimizations that don't get used. So just be advised



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -232,7 +231,13 @@ trait SparkAdapter extends Serializable {
   def createParquetFileReader(vectorized: Boolean,
                               sqlConf: SQLConf,
                               options: Map[String, String],
-                              hadoopConf: Configuration): SparkParquetReader
+                              hadoopConf: Configuration): 
SparkColumnarFileReader
+
+  def createOrcFileReader(vectorized: Boolean,
+                          sqlConf: SQLConf,
+                          options: Map[String, String],
+                          hadoopConf: Configuration,
+                          dataSchema: StructType): SparkColumnarFileReader

Review Comment:
   Why do we need to pass the data schema here? Can we avoid this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8746] Add ORC support to FileGroupReader paths and migrate Spark multi-format reading to FileGroupReader [hudi]

Reply via email to