xiarixiaoyao commented on PR #8082: URL: https://github.com/apache/hudi/pull/8082#issuecomment-1537984230
> @xiarixiaoyao Thanks for your analysis. > > Ive tried adding that code block you linked in this pr. The one thing I am seeing from the tests is a new failure since this "returning_batch" config does not seem to be getting set internally by spark. > > ``` > java.lang.IllegalArgumentException: OPTION_RETURNING_BATCH should always be set for ParquetFileFormat. To workaround this issue, set spark.sql.parquet.enableVectorizedReader=false. > ``` > > Do you have any idea why applying this fix from spark is causing issues? From my understanding The property should be set within spark https://github.com/apache/hudi/pull/8082/files > > ``` > > lazy val inputRDD: RDD[InternalRow] = { > val options = relation.options + > (FileFormat.OPTION_RETURNING_BATCH -> supportsColumnar.toString) > val readFile: (PartitionedFile) => Iterator[InternalRow] = > relation.fileFormat.buildReaderWithPartitionValues( > sparkSession = relation.sparkSession, > dataSchema = relation.dataSchema, > partitionSchema = relation.partitionSchema, > requiredSchema = requiredSchema, > filters = pushedDownFilters, > options = options, > hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options)) > ``` > > should be set inside `DataSourceScanExec.scala`. @rahil-c hoodie mor table cannot trigger spark FileSourceStrategy plan. let's pass FileFormat.OPTION_RETURNING_BATCH by ourselves -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
