codejoyan edited a comment on issue #3581:
URL: https://github.com/apache/hudi/issues/3581#issuecomment-920313961
Hi @xushiyan / @vinothchandar
To this question: Not sure if "Listing leaf files ..." shows up with file
listings enabled" -> Even I thought so. But I could see that the metadata table
is created while writing. Doesn't that mean it should read from the metadata
table instead of doing a file listing:
295 2021-09-15T17:50:46Z
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/hoodie.properties
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/.temp/
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/archived/
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/metadata/
Also in the log I see the time taken to load is as below. I have daily
partitions and table has 3 months data, so for 90 partitions it is taking > 2
minutes to list. Below is the log snippet. I also see the below warning message:
21/09/15 18:52:22 WARN
org.apache.spark.sql.execution.datasources.SharedInMemoryCache: Evicting cached
table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may
impact query planning performance.
**Log Snippet**
21/09/15 18:51:00 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET)
from gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
21/09/15 18:51:00 INFO org.apache.hudi.DefaultSource: Is bootstrapped table
=> false
21/09/15 18:51:00 WARN org.apache.hudi.DefaultSource: Loading Base File Only
View.
21/09/15 18:51:00 INFO org.apache.hudi.DefaultSource: Constructing hoodie
(as parquet) data source with options :Map(hoodie.file.index.enable -> true,
hoodie.datasource.query.type -> snapshot, path ->
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1)
21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading HoodieTableMetaClient from
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
21/09/15 18:51:08 INFO org.apache.hudi.common.fs.FSUtils: Hadoop
Configuration: fs.defaultFS: [hdfs://wl1-hudi-delta-poc-m],
Config:[Configuration: core-default.xml, core-site.xml, yarn-default.xml,
yarn-site.xml, resource-types.xml, mapred-default.xml, mapred-site.xml,
hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml], FileSystem:
[com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@3dbf8b79]
21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableConfig:
Loading table properties from
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/.hoodie/hoodie.properties
21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET)
from gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
21/09/15 18:51:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading Active commit timeline for
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
21/09/15 18:51:08 INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants
[[20210912173234__clean__COMPLETED], [20210912173527__clean__COMPLETED],
[20210912173901__clean__COMPLETED], [20210912174152__clean__COMPLETED],
[20210912174502__clean__COMPLETED], [20210912174759__clean__COMPLETED],
[20210912174924__clean__COMPLETED], [20210912175218__clean__COMPLETED],
[20210912175537__clean__COMPLETED], [20210912175855__clean__COMPLETED],
[20210912180207__clean__COMPLETED], [20210912180333__clean__COMPLETED],
[20210912180725__clean__COMPLETED], [20210912180924__clean__COMPLETED],
[20210915123142__clean__COMPLETED], [20210915123711__clean__COMPLETED],
[20210915124117__clean__COMPLETED], [20210915124530__clean__COMPLETED],
[20210915125255__clean__COMPLETED], [20210915130150__clean__COMPLETED],
[20210915144750__commit__COMPLETED], [20210915145319__commit__COMPLETED],
[20210915145759__commit__COMPLETED], [20210915150055__commit__COMPLETED],
[20210915150548__commit__CO
MPLETED], [20210915151050__commit__COMPLETED],
[20210915151538__commit__COMPLETED], [20210915152045__commit__COMPLETED],
[20210915152606__commit__COMPLETED], [20210915153100__commit__COMPLETED],
[20210915153610__commit__COMPLETED], [20210915154035__commit__COMPLETED],
[20210915154500__commit__COMPLETED], [20210915155004__commit__COMPLETED],
[20210915155512__commit__COMPLETED], [20210915160015__commit__COMPLETED],
[20210915160528__commit__COMPLETED], [20210915161020__commit__COMPLETED],
[20210915161534__commit__COMPLETED], [20210915161949__commit__COMPLETED],
[20210915162501__commit__COMPLETED], [20210915162953__commit__COMPLETED],
[20210915163456__commit__COMPLETED], [20210915164016__commit__COMPLETED],
[20210915164523__commit__COMPLETED], [20210915165033__commit__COMPLETED],
[20210915165545__commit__COMPLETED], [20210915170102__commit__COMPLETED],
[20210915182112__commit__COMPLETED], [20210915182251__replacecommit__COMPLETED]]
21/09/15 18:51:08 INFO
org.apache.hudi.common.table.view.FileSystemViewManager: Creating InMemory
based view for basePath
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
21/09/15 18:51:08 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 156 ms to
read 1 instants, 3000 replaced file groups
21/09/15 18:51:08 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0
files in pending clustering operations
21/09/15 18:51:08 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Building file
system view for partition (WMT-US/2020-11-01)
21/09/15 18:51:08 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in
partition (WMT-US/2020-11-01) =1501, Time taken =204
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.HoodieTableFileSystemView: Adding file-groups
for partition :WMT-US/2020-11-01, #FileGroups=1500
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: addFilesToView:
NumFiles=1501, NumFileGroups=1500, FileGroupsCreationTime=91, StoreTimeTaken=3
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Time to load
partition (WMT-US/2020-11-01) =340
21/09/15 18:51:09 INFO org.apache.hudi.hadoop.HoodieROTablePathFilter: Based
on hoodie metadata from base path:
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1, caching 1500 files
under
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/WMT-US/2020-11-01
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 117 ms to
read 1 instants, 3000 replaced file groups
21/09/15 18:51:09 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0
files in pending clustering operations
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.FileSystemViewManager: Creating InMemory
based view for basePath
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 91 ms to
read 1 instants, 3000 replaced file groups
21/09/15 18:51:09 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0
files in pending clustering operations
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Building file
system view for partition (WMT-US/2020-11-02)
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in
partition (WMT-US/2020-11-02) =1501, Time taken =177
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.HoodieTableFileSystemView: Adding file-groups
for partition :WMT-US/2020-11-02, #FileGroups=1500
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: addFilesToView:
NumFiles=1501, NumFileGroups=1500, FileGroupsCreationTime=63, StoreTimeTaken=1
21/09/15 18:51:09 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Time to load
partition (WMT-US/2020-11-02) =280
21/09/15 18:51:09 INFO org.apache.hudi.hadoop.HoodieROTablePathFilter: Based
on hoodie metadata from base path:
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1, caching 1500 files
under
gs://udp-hudi-storage3/store_visit_scan_hudi_spark_3_tgt_v1/WMT-US/2020-11-02
21/09/15 18:51:10 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 93 ms to
read 1 instants, 3000 replaced file groups
....
21/09/15 18:52:22 WARN
org.apache.spark.sql.execution.datasources.SharedInMemoryCache: Evicting cached
table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may
impact query planning performance.
Looking forward to some pointers. Let me know if I can do some analysis on
my end that would help?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]