codejoyan edited a comment on issue #3581:
URL: https://github.com/apache/hudi/issues/3581#issuecomment-920313961


   Hi @xushiyan / @vinothchandar
   
   To this question: Not sure if "Listing leaf files ..." shows up with file 
listings enabled" -> Even I thought so. But I could see that the metadata table 
is created while writing. Doesn't that mean it should read from the metadata 
table instead of doing a file listing:
   
          295  2021-09-15T17:50:46Z  
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/hoodie.properties
                                    
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/.temp/
                                    
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/archived/
                                    
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/metadata/
   
   Also in the log I see the time taken to load is as below. I have daily 
partitions and table has 3 months data, so for 90 partitions it is taking > 2 
minutes to list. Below is the log snippet. I also see the below warning message:
   21/09/15 18:52:22 WARN 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache: Evicting cached 
table partition metadata from memory due to size constraints 
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may 
impact query planning performance.
   
   **Log Snippet**
   21/09/15 18:51:00 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
   21/09/15 18:51:00 INFO org.apache.hudi.DefaultSource: Is bootstrapped table 
=> false
   21/09/15 18:51:00 WARN org.apache.hudi.DefaultSource: Loading Base File Only 
View.
   21/09/15 18:51:00 INFO org.apache.hudi.DefaultSource: Constructing hoodie 
(as parquet) data source with options :Map(hoodie.file.index.enable -> true, 
hoodie.datasource.query.type -> snapshot, path -> 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1)
   21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
   21/09/15 18:51:08 INFO org.apache.hudi.common.fs.FSUtils: Hadoop 
Configuration: fs.defaultFS: [hdfs://wl1-hudi-delta-poc-m], 
Config:[Configuration: core-default.xml, core-site.xml, yarn-default.xml, 
yarn-site.xml, resource-types.xml, mapred-default.xml, mapred-site.xml, 
hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml], FileSystem: 
[com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@3dbf8b79]
   21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/hoodie.properties
   21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
   21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading Active commit timeline for 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
   21/09/15 18:51:08 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
[[20210912173234__clean__COMPLETED], [20210912173527__clean__COMPLETED], 
[20210912173901__clean__COMPLETED], [20210912174152__clean__COMPLETED], 
[20210912174502__clean__COMPLETED], [20210912174759__clean__COMPLETED], 
[20210912174924__clean__COMPLETED], [20210912175218__clean__COMPLETED], 
[20210912175537__clean__COMPLETED], [20210912175855__clean__COMPLETED], 
[20210912180207__clean__COMPLETED], [20210912180333__clean__COMPLETED], 
[20210912180725__clean__COMPLETED], [20210912180924__clean__COMPLETED], 
[20210915123142__clean__COMPLETED], [20210915123711__clean__COMPLETED], 
[20210915124117__clean__COMPLETED], [20210915124530__clean__COMPLETED], 
[20210915125255__clean__COMPLETED], [20210915130150__clean__COMPLETED], 
[20210915144750__commit__COMPLETED], [20210915145319__commit__COMPLETED], 
[20210915145759__commit__COMPLETED], [20210915150055__commit__COMPLETED], 
[20210915150548__commit__CO
 MPLETED], [20210915151050__commit__COMPLETED], 
[20210915151538__commit__COMPLETED], [20210915152045__commit__COMPLETED], 
[20210915152606__commit__COMPLETED], [20210915153100__commit__COMPLETED], 
[20210915153610__commit__COMPLETED], [20210915154035__commit__COMPLETED], 
[20210915154500__commit__COMPLETED], [20210915155004__commit__COMPLETED], 
[20210915155512__commit__COMPLETED], [20210915160015__commit__COMPLETED], 
[20210915160528__commit__COMPLETED], [20210915161020__commit__COMPLETED], 
[20210915161534__commit__COMPLETED], [20210915161949__commit__COMPLETED], 
[20210915162501__commit__COMPLETED], [20210915162953__commit__COMPLETED], 
[20210915163456__commit__COMPLETED], [20210915164016__commit__COMPLETED], 
[20210915164523__commit__COMPLETED], [20210915165033__commit__COMPLETED], 
[20210915165545__commit__COMPLETED], [20210915170102__commit__COMPLETED], 
[20210915182112__commit__COMPLETED], [20210915182251__replacecommit__COMPLETED]]
   21/09/15 18:51:08 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating InMemory 
based view for basePath 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
   21/09/15 18:51:08 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 156 ms to 
read  1 instants, 3000 replaced file groups
   21/09/15 18:51:08 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   21/09/15 18:51:08 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Building file 
system view for partition (WMT-US/2020-11-01)
   21/09/15 18:51:08 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in 
partition (WMT-US/2020-11-01) =1501, Time taken =204
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.HoodieTableFileSystemView: Adding file-groups 
for partition :WMT-US/2020-11-01, #FileGroups=1500
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: addFilesToView: 
NumFiles=1501, NumFileGroups=1500, FileGroupsCreationTime=91, StoreTimeTaken=3
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Time to load 
partition (WMT-US/2020-11-01) =340
   21/09/15 18:51:09 INFO org.apache.hudi.hadoop.HoodieROTablePathFilter: Based 
on hoodie metadata from base path: 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1, caching 1500 files 
under 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/WMT-US/2020-11-01
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 117 ms to 
read  1 instants, 3000 replaced file groups
   21/09/15 18:51:09 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.FileSystemViewManager: Creating InMemory 
based view for basePath 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 91 ms to 
read  1 instants, 3000 replaced file groups
   21/09/15 18:51:09 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Building file 
system view for partition (WMT-US/2020-11-02)
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in 
partition (WMT-US/2020-11-02) =1501, Time taken =177
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.HoodieTableFileSystemView: Adding file-groups 
for partition :WMT-US/2020-11-02, #FileGroups=1500
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: addFilesToView: 
NumFiles=1501, NumFileGroups=1500, FileGroupsCreationTime=63, StoreTimeTaken=1
   21/09/15 18:51:09 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Time to load 
partition (WMT-US/2020-11-02) =280
   21/09/15 18:51:09 INFO org.apache.hudi.hadoop.HoodieROTablePathFilter: Based 
on hoodie metadata from base path: 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1, caching 1500 files 
under 
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/WMT-US/2020-11-02
   21/09/15 18:51:10 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 93 ms to 
read  1 instants, 3000 replaced file groups
   ....
   21/09/15 18:52:22 WARN 
org.apache.spark.sql.execution.datasources.SharedInMemoryCache: Evicting cached 
table partition metadata from memory due to size constraints 
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may 
impact query planning performance.
   
   Looking forward to some pointers. Let me know if I can do some analysis on 
my end that would help?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to