[
https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar reopened HUDI-1479:
----------------------------------
> Replace FSUtils.getAllPartitionPaths() with
> HoodieTableMetadata#getAllPartitionPaths()
> --------------------------------------------------------------------------------------
>
> Key: HUDI-1479
> URL: https://issues.apache.org/jira/browse/HUDI-1479
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Code Cleanup
> Reporter: Vinoth Chandar
> Assignee: Udit Mehrotra
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2021-01-05-10-00-35-187.png
>
>
> *Change #1*
> {code:java}
> public static List<String> getAllPartitionPaths(FileSystem fs, String
> basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
> boolean
> assumeDatePartitioning) throws IOException {
> if (assumeDatePartitioning) {
> return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
> } else {
> HoodieTableMetadata tableMetadata =
> HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/",
> useFileListingFromMetadata,
> verifyListings, false, false);
> return tableMetadata.getAllPartitionPaths();
> }
> }
> {code}
> is the current implementation, where `HoodieTableMetadata.create()` always
> creates `HoodieBackedTableMetadata`. Instead we should create
> `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways.
> This helps address https://github.com/apache/hudi/pull/2398/files#r550709687
> *Change #2*
> On master, we have the `HoodieEngineContext` abstraction, which allows for
> parallel execution. We should consider moving it to `hudi-common` (its
> doable) and then have `FileSystemBackedTableMetadata` redone such that it can
> do parallelized listings using the passed in engine. either
> HoodieSparkEngineContext or HoodieJavaEngineContext.
> HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized
> code. We should take one pass and see if that can be redone a bit as well.
> Food for thought:
> https://github.com/apache/hudi/pull/2398#discussion_r550711216
>
> *Change #3*
> There are places, where we call fs.listStatus() directly. We should make them
> go through the HoodieTable.getMetadata()... route as well. Essentially, all
> listing should be concentrated to `FileSystemBackedTableMetadata`
> !image-2021-01-05-10-00-35-187.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)