[GitHub] [hudi] xiarixiaoyao commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

via GitHub Mon, 08 May 2023 01:47:19 -0700


xiarixiaoyao commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1537984230


   > @xiarixiaoyao Thanks for your analysis.
   > 
   > Ive tried adding that code block you linked in this pr. The one thing I am 
seeing from the tests is a new failure since this "returning_batch" config does 
not seem to be getting set internally by spark.
   > 
   > ```
   > java.lang.IllegalArgumentException: OPTION_RETURNING_BATCH should always 
be set for ParquetFileFormat. To workaround this issue, set 
spark.sql.parquet.enableVectorizedReader=false.
   > ```
   > 
   > Do you have any idea why applying this fix from spark is causing issues? 
From my understanding The property should be set within spark 
https://github.com/apache/hudi/pull/8082/files
   > 
   > ```
   > 
   >   lazy val inputRDD: RDD[InternalRow] = {
   >     val options = relation.options +
   >       (FileFormat.OPTION_RETURNING_BATCH -> supportsColumnar.toString)
   >     val readFile: (PartitionedFile) => Iterator[InternalRow] =
   >       relation.fileFormat.buildReaderWithPartitionValues(
   >         sparkSession = relation.sparkSession,
   >         dataSchema = relation.dataSchema,
   >         partitionSchema = relation.partitionSchema,
   >         requiredSchema = requiredSchema,
   >         filters = pushedDownFilters,
   >         options = options,
   >         hadoopConf = 
relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))
   > ```
   > 
   > should be set inside `DataSourceScanExec.scala`.
   
   @rahil-c 
     hoodie mor table cannot trigger spark FileSourceStrategy plan. let's pass 
FileFormat.OPTION_RETURNING_BATCH by ourselves


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] xiarixiaoyao commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

Reply via email to