I have created this https://issues.apache.org/jira/browse/HUDI-1232 ticket for tracking a couple of issues.
One of the concerns I have in my use cases is that, have a COW type table name called TRR. I see below pasted logs rolling for all individual partitions even though my write is on only a couple of partitions and it takes upto 4 to 5 mins. I pasted only a few of them alone. I am wondering , in the future , I will have 3 years worth of data, and writing will be very slow every time I write into only a couple of partitions. 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@fed0a8b 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200714/01, #FileGroups=1 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=1 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 files under hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/01 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, m apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V ISA.COM (auth:KERBEROS)]]] 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@285c67a9 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200714/02, #FileGroups=1 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=4, FileGroupsCreationTime=0, StoreTimeTaken=0 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 files under hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/02 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, m apred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, ugi=svchdc36q@V ISA.COM (auth:KERBEROS)]]] 20/08/27 02:08:22 INFO HoodieTableConfig: Loading dataset properties from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/.hoodie/hoodie.properties 20/08/27 02:08:22 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr 20/08/27 02:08:22 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@2edd9c8 20/08/27 02:08:22 INFO HoodieTableFileSystemView: Adding file-groups for partition :20200714/03, #FileGroups=1 20/08/27 02:08:22 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=4, FileGroupsCreationTime=1, StoreTimeTaken=0 20/08/27 02:08:22 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr, caching 1 files under hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr/20200714/03 20/08/27 02:08:22 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from hdfs://oprhqanameservice/projects/cdp/data/cdp_reporting/trr 20/08/27 02:08:22 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://oprhqanameservice], Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml], FileSystem: [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-778362260_1, [email protected] (auth:KERBEROS)]]] Seems more and more partitions we have, path filter lists take more time. Could someone provide more insight on how to make these things work faster and make it scalable when the number of partitions is increasing? Thanks, Selva
