[GitHub] [hudi] danny0405 commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

via GitHub Thu, 04 May 2023 22:27:17 -0700


danny0405 commented on PR #8082:
URL: https://github.com/apache/hudi/pull/8082#issuecomment-1535715859


   > > If we'd like to have this fix in 0.13.1 release without introducing 
performance problems for existing Spark versions, could we consider the 
following to triage the scope of impact?
   > > (1) Could we disable the optimization rule of nested schema pruning for 
Spark 3.3.2 only, and see if the tests can pass (without config change of 
vectorized reader)? This is done by not adding 
`org.apache.spark.sql.execution.datasources.Spark33NestedSchemaPruning` for 
Spark 3.3.2 in 
[HoodieAnalysis](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala#L132).
 (2) If the above does not work, could we disable the vectorized reader for 
Spark 3.3.2 only? And still use Spark 3.3.1 as the compile dependency in this 
case? (3) Could we also list all the failed tests and see what are in common 
for further investigation?
   > 
   > @yihua 1). When trying to disable the optimization rule and run the test 
it seems the issues is present when trying on several failed tests.
   > 
   > 2). I think this path can work, i added a check to only have this config 
set to true only when it encounters spark version other spark 3.3.2. Im not 
sure what you mean on using spark 3.3.1 compile dependency though.
   > 
   > 3). Have test failure list above
   > 
   > cc @danny0405 @xiarixiaoyao
   
   I have concern that the whole stage code gen does not even support 
vectorized reader at all, I'm speculating that old version Spark does not throw 
just because it does not trigger the whole stage code optimization. Somehow, I 
think we should make the vectorized reader flag configurable, but maybe not in 
this patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] danny0405 commented on pull request #8082: [HUDI-5868] Upgrade Spark to 3.3.2

Reply via email to