Junyewu commented on issue #7417:
URL: https://github.com/apache/hudi/issues/7417#issuecomment-1346095524

   Introduced the following PR in hudi-0.11.0-rc1 will again causes slow load 
issues. 
   [[HUDI-2779](https://issues.apache.org/jira/browse/HUDI-2779)] Cache BaseDir 
if HudiTableNotFound Exception thrown (https://github.com/apache/hudi/pull/4014)
   
   
   
   When i removed that code in hudi-0.12.1,the slow load problem was alleviated.
   ```
         if (baseDir != null) {
           // Check whether baseDir in nonHoodiePathCache
   //        if (nonHoodiePathCache.contains(baseDir.toString())) {
   //          if (LOG.isDebugEnabled()) {
   //            LOG.debug("Accepting non-hoodie path from cache: " + path);
   //          }
   //          return true;
   //        }
   ```
   
   Example:
   ```
   ==submit application==
   # sudo -u hive spark-sql --master yarn --conf 
spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter
 --jars s3://bucket1/hudi-spark3.1-bundle_2.12-0.12.1.jar
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   22/12/12 16:02:04 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   22/12/12 16:02:05 WARN Utils: spark.executor.instances less than 
spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please 
update your configs.
   22/12/12 16:02:05 WARN Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
   22/12/12 16:02:14 WARN Utils: spark.executor.instances less than 
spark.dynamicAllocation.minExecutors is invalid, ignoring its setting, please 
update your configs.
   22/12/12 16:02:14 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted 
to request executors before the AM has registered!
   Spark master: yarn, Application Id: application_1660903282590_9234
   
   ==run load==
   spark-sql> create or replace temporary view src_order_cate_query using 
parquet options(path 's3://bucket1/search_offline/src_order_cate_query/');    
//this path have 570 partitions, 5.4w parquet files
   22/12/12 16:04:02 WARN SharedInMemoryCache: Evicting cached table partition 
metadata from memory due to size constraints 
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may 
impact query planning performance.
   Time taken: 97.353 seconds
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to