Volodymyr Burenin created HUDI-1697:
---------------------------------------
Summary: A parallel scan needed for FS.
Key: HUDI-1697
URL: https://issues.apache.org/jira/browse/HUDI-1697
Project: Apache Hudi
Issue Type: Improvement
Components: DeltaStreamer
Reporter: Volodymyr Burenin
I am running Hudi with GCS as a backend. It takes way too long to update the
file system view for several hundred partitions. I think it can be done in
parallel, so the process could be speed up significantly.
Here is a small cut from the logs where I notice the slow processing. The
original one is much longer and takes several minutes to complete.
```
21/03/16 20:02:56 INFO AbstractTableFileSystemView: #files found in partition
(2020/05/12) =66, Time taken =45
21/03/16 20:02:56 INFO HoodieTableFileSystemView: Adding file-groups for
partition :2020/05/12, #FileGroups=22
21/03/16 20:02:56 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=66, NumFileGroups=22, FileGroupsCreationTime=3, StoreTimeTaken=1
21/03/16 20:02:56 INFO AbstractTableFileSystemView: Time to load partition
(2020/05/12) =76
21/03/16 20:02:56 INFO AbstractTableFileSystemView: Took 1 ms to read 0
instants, 0 replaced file groups
21/03/16 20:02:56 INFO ClusteringUtils: Found 0 files in pending clustering
operations
21/03/16 20:02:56 INFO AbstractTableFileSystemView: Building file system view
for partition (2020/03/25)
21/03/16 20:02:56 INFO AbstractTableFileSystemView: #files found in partition
(2020/03/25) =36, Time taken =36
21/03/16 20:02:56 INFO HoodieTableFileSystemView: Adding file-groups for
partition :2020/03/25, #FileGroups=12
21/03/16 20:02:56 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=36, NumFileGroups=12, FileGroupsCreationTime=1, StoreTimeTaken=1
21/03/16 20:02:56 INFO AbstractTableFileSystemView: Time to load partition
(2020/03/25) =62
21/03/16 20:02:56 INFO AbstractTableFileSystemView: Took 0 ms to read 0
instants, 0 replaced file groups
21/03/16 20:02:56 INFO ClusteringUtils: Found 0 files in pending clustering
operations
21/03/16 20:02:56 INFO AbstractTableFileSystemView: Building file system view
for partition (2020/10/15)
21/03/16 20:02:57 INFO AbstractTableFileSystemView: #files found in partition
(2020/10/15) =201, Time taken =100
21/03/16 20:02:57 INFO HoodieTableFileSystemView: Adding file-groups for
partition :2020/10/15, #FileGroups=128
21/03/16 20:02:57 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=201, NumFileGroups=128, FileGroupsCreationTime=6, StoreTimeTaken=1
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Time to load partition
(2020/10/15) =148
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Took 0 ms to read 0
instants, 0 replaced file groups
21/03/16 20:02:57 INFO ClusteringUtils: Found 0 files in pending clustering
operations
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Building file system view
for partition (2021/01/11)
21/03/16 20:02:57 INFO AbstractTableFileSystemView: #files found in partition
(2021/01/11) =311, Time taken =71
21/03/16 20:02:57 INFO HoodieTableFileSystemView: Adding file-groups for
partition :2021/01/11, #FileGroups=302
21/03/16 20:02:57 INFO AbstractTableFileSystemView: addFilesToView:
NumFiles=311, NumFileGroups=302, FileGroupsCreationTime=9, StoreTimeTaken=1
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Time to load partition
(2021/01/11) =110
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Took 0 ms to read 0
instants, 0 replaced file groups
21/03/16 20:02:57 INFO ClusteringUtils: Found 0 files in pending clustering
operations
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Building file system view
for partition (2019/07/08)
21/03/16 20:02:57 INFO AbstractTableFileSystemView: #files found in partition
(2019/07/08) =2, Time taken =40
21/03/16 20:02:57 INFO HoodieTableFileSystemView: Adding file-groups for
partition :2019/07/08, #FileGroups=1
21/03/16 20:02:57 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=2,
NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=1
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Time to load partition
(2019/07/08) =63
21/03/16 20:02:57 INFO AbstractTableFileSystemView: Took 0 ms to read 0
instants, 0 replaced file groups
21/03/16 20:02:57 INFO ClusteringUtils: Found 0 files in pending clustering
operations
```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)