jonvex commented on code in PR #13632:
URL: https://github.com/apache/hudi/pull/13632#discussion_r2237990007
##########
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##########
@@ -1248,7 +1249,10 @@ private static Object
rewritePrimaryTypeWithDiffSchemaType(Object oldValue, Sche
return String.valueOf(oldValue);
}
if (oldSchema.getType() == Schema.Type.BYTES) {
- return String.valueOf(oldValue);
+ if (oldValue instanceof ByteBuffer) {
Review Comment:
In my schema evolution pr I changed this to
`StringUtils.fromUTF8Bytes(((ByteBuffer) oldValue).array());`. Will this do the
same thing?
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/SparkReaderContextFactory.java:
##########
@@ -88,18 +92,26 @@ class SparkReaderContextFactory implements
ReaderContextFactory<InternalRow> {
configs.set(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE().key(),
sqlConf.getConfString(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE().key()));
configs.set(SQLConf.PARQUET_WRITE_LEGACY_FORMAT().key(),
sqlConf.getConfString(SQLConf.PARQUET_WRITE_LEGACY_FORMAT().key()));
configurationBroadcast = jsc.broadcast(new
SerializableConfiguration(configs));
- // Broadcast: ParquetReader.
- // Spark parquet reader has to be instantiated on the driver and broadcast
to the executors
- SparkParquetReader parquetFileReader =
sparkAdapter.createParquetFileReader(false, sqlConf, options, configs);
- parquetReaderBroadcast = jsc.broadcast(parquetFileReader);
+ // Broadcast: BaseFilereader.
Review Comment:
is there any downsides to just doing this for all tables. If there isn't
much cost then it would be easier to maintain to just use the multiple base
formats supported path
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala:
##########
@@ -61,20 +64,21 @@ trait HoodieFormatTrait {
/**
* This class utilizes {@link HoodieFileGroupReader} and its related classes
to support reading
- * from Parquet formatted base files and their log files.
+ * from Parquet or ORC formatted base files and their log files.
*/
Review Comment:
I'm generally ok with this, but there is a method that is returning that it
is a parquet format, and if you change it to something else then there will be
optimizations that don't get used. So just be advised
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala:
##########
@@ -232,7 +231,13 @@ trait SparkAdapter extends Serializable {
def createParquetFileReader(vectorized: Boolean,
sqlConf: SQLConf,
options: Map[String, String],
- hadoopConf: Configuration): SparkParquetReader
+ hadoopConf: Configuration):
SparkColumnarFileReader
+
+ def createOrcFileReader(vectorized: Boolean,
+ sqlConf: SQLConf,
+ options: Map[String, String],
+ hadoopConf: Configuration,
+ dataSchema: StructType): SparkColumnarFileReader
Review Comment:
Why do we need to pass the data schema here? Can we avoid this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]