[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4789: [HUDI-1296] Support Metadata Table in Spark Datasource

GitBox Wed, 23 Feb 2022 14:44:10 -0800


alexeykudinkin commented on a change in pull request #4789:
URL: https://github.com/apache/hudi/pull/4789#discussion_r813390665




##########
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala
##########
@@ -323,32 +322,60 @@ private object HoodieMergeOnReadRDD {
 
   def scanLog(split: HoodieMergeOnReadFileSplit, logSchema: Schema, config: 
Configuration): HoodieMergedLogRecordScanner = {
     val fs = FSUtils.getFs(split.tablePath, config)
-    val partitionPath: String = if (split.logPaths.isEmpty || 
split.logPaths.get.asJava.isEmpty) {
-      null
+    val logFiles = split.logFiles.get
+
+    if (HoodieTableMetadata.isMetadataTable(split.tablePath)) {

Review comment:
       It's, but it's actually using its own standalone scanner (meaning it 
couldn't be read with the default one).
   
   I agree that we should trim down as much as possible such bifurcations and 
instead make sure that existing Hudi components are extended in a way that can 
handle MT, without a need for specialized one

##########
File path: 
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
##########
@@ -41,8 +107,8 @@ object AvroConversionUtils {
         else {
           val schema = new Schema.Parser().parse(schemaStr)
           val dataType = convertAvroSchemaToStructType(schema)
-          val convertor = AvroConversionHelper.createConverterToRow(schema, 
dataType)
-          records.map { x => convertor(x).asInstanceOf[Row] }
+          val converter = createConverterToRow(schema, dataType)

Review comment:
       @yihua you brought up legitimate concerns, there are a few 
considerations here
   
    - First of all, we already depend on `InternalRow` quite a bit (we even 
have out own `HoodieInternalRow` extension)
    - `InternalRow` is a core component of Spark that is unlikely to change 
substantially (as that would mean that quite a bit of Spark will have to be 
re-written to accommodate for any substantial changes to it, which, again, i 
don't think are likely)
   
   At the same time avoiding `InternalRow` > `Row` conversion has some 
considerable performance advantages  , and it's exactly how Spark operates 
internally: all of its internal `Plan`s, expressions, operators, operate on 
`InternalRow` and defer such deserialization (`InternalRow` to `Row`) only to 
cases when you dereference it to `RDD[Row]`.
   
   All in all, for us to be able to compete w/ Delta on performance we will 
have to pull the same tricks they are using and make sure our overhead is as 
lean as possible




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4789: [HUDI-1296] Support Metadata Table in Spark Datasource

Reply via email to