alexeykudinkin commented on a change in pull request #4789:
URL: https://github.com/apache/hudi/pull/4789#discussion_r813390665
##########
File path:
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
##########
@@ -323,32 +322,60 @@ private object HoodieMergeOnReadRDD {
def scanLog(split: HoodieMergeOnReadFileSplit, logSchema: Schema, config:
Configuration): HoodieMergedLogRecordScanner = {
val fs = FSUtils.getFs(split.tablePath, config)
- val partitionPath: String = if (split.logPaths.isEmpty ||
split.logPaths.get.asJava.isEmpty) {
- null
+ val logFiles = split.logFiles.get
+
+ if (HoodieTableMetadata.isMetadataTable(split.tablePath)) {
Review comment:
It's, but it's actually using its own standalone scanner (meaning it
couldn't be read with the default one).
I agree that we should trim down as much as possible such bifurcations and
instead make sure that existing Hudi components are extended in a way that can
handle MT, without a need for specialized one
##########
File path:
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
##########
@@ -41,8 +107,8 @@ object AvroConversionUtils {
else {
val schema = new Schema.Parser().parse(schemaStr)
val dataType = convertAvroSchemaToStructType(schema)
- val convertor = AvroConversionHelper.createConverterToRow(schema,
dataType)
- records.map { x => convertor(x).asInstanceOf[Row] }
+ val converter = createConverterToRow(schema, dataType)
Review comment:
@yihua you brought up legitimate concerns, there are a few
considerations here
- First of all, we already depend on `InternalRow` quite a bit (we even
have out own `HoodieInternalRow` extension)
- `InternalRow` is a core component of Spark that is unlikely to change
substantially (as that would mean that quite a bit of Spark will have to be
re-written to accommodate for any substantial changes to it, which, again, i
don't think are likely)
At the same time avoiding `InternalRow` > `Row` conversion has some
considerable performance advantages , and it's exactly how Spark operates
internally: all of its internal `Plan`s, expressions, operators, operate on
`InternalRow` and defer such deserialization (`InternalRow` to `Row`) only to
cases when you dereference it to `RDD[Row]`.
All in all, for us to be able to compete w/ Delta on performance we will
have to pull the same tricks they are using and make sure our overhead is as
lean as possible
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]