[
https://issues.apache.org/jira/browse/HUDI-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-1479:
---------------------------------
Description:
*Change #1*
{code:java}
public static List<String> getAllPartitionPaths(FileSystem fs, String
basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
boolean
assumeDatePartitioning) throws IOException {
if (assumeDatePartitioning) {
return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
} else {
HoodieTableMetadata tableMetadata =
HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/",
useFileListingFromMetadata,
verifyListings, false, false);
return tableMetadata.getAllPartitionPaths();
}
}
{code}
is the current implementation, where `HoodieTableMetadata.create()` always
creates `HoodieBackedTableMetadata`. Instead we should create
`FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways.
This helps address https://github.com/apache/hudi/pull/2398/files#r550709687
*Change #2*
On master, we have the `HoodieEngineContext` abstraction, which allows for
parallel execution. We should consider moving it to `hudi-common` (its doable)
and then have `FileSystemBackedTableMetadata` redone such that it can do
parallelized listings using the passed in engine. either
HoodieSparkEngineContext or HoodieJavaEngineContext.
HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized
code. We should take one pass and see if that can be redone a bit as well.
Food for thought: https://github.com/apache/hudi/pull/2398#discussion_r550711216
*Change #3*
There are places, where we call fs.listStatus() directly. We should make them
go through the HoodieTable.getMetadata()... route as well. Essentially, all
listing should be concentrated to `FileSystemBackedTableMetadata`
!image-2021-01-05-10-00-35-187.png!
was:
*Change #1*
{code:java}
public static List<String> getAllPartitionPaths(FileSystem fs, String
basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
boolean
assumeDatePartitioning) throws IOException {
if (assumeDatePartitioning) {
return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
} else {
HoodieTableMetadata tableMetadata =
HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/",
useFileListingFromMetadata,
verifyListings, false, false);
return tableMetadata.getAllPartitionPaths();
}
}
{code}
is the current implementation, where `HoodieTableMetadata.create()` always
creates `HoodieBackedTableMetadata`. Instead we should create
`FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways
*Change #2*
On master, we have the `HoodieEngineContext` abstraction, which allows for
parallel execution. We should consider moving it to `hudi-common` (its doable)
and then have `FileSystemBackedTableMetadata` redone such that it can do
parallelized listings using the passed in engine. either
HoodieSparkEngineContext or HoodieJavaEngineContext.
HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized
code. We should take one pass and see if that can be redone a bit as well.
*Change #3*
There are places, where we call fs.listStatus() directly. We should make them
go through the HoodieTable.getMetadata()... route as well. Essentially, all
listing should be concentrated to `FileSystemBackedTableMetadata`
!image-2021-01-05-10-00-35-187.png!
> Replace FSUtils.getAllPartitionPaths() with
> HoodieTableMetadata#getAllPartitionPaths()
> --------------------------------------------------------------------------------------
>
> Key: HUDI-1479
> URL: https://issues.apache.org/jira/browse/HUDI-1479
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Code Cleanup
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Blocker
> Fix For: 0.7.0
>
> Attachments: image-2021-01-05-10-00-35-187.png
>
>
> *Change #1*
> {code:java}
> public static List<String> getAllPartitionPaths(FileSystem fs, String
> basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
> boolean
> assumeDatePartitioning) throws IOException {
> if (assumeDatePartitioning) {
> return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
> } else {
> HoodieTableMetadata tableMetadata =
> HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/",
> useFileListingFromMetadata,
> verifyListings, false, false);
> return tableMetadata.getAllPartitionPaths();
> }
> }
> {code}
> is the current implementation, where `HoodieTableMetadata.create()` always
> creates `HoodieBackedTableMetadata`. Instead we should create
> `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways.
> This helps address https://github.com/apache/hudi/pull/2398/files#r550709687
> *Change #2*
> On master, we have the `HoodieEngineContext` abstraction, which allows for
> parallel execution. We should consider moving it to `hudi-common` (its
> doable) and then have `FileSystemBackedTableMetadata` redone such that it can
> do parallelized listings using the passed in engine. either
> HoodieSparkEngineContext or HoodieJavaEngineContext.
> HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized
> code. We should take one pass and see if that can be redone a bit as well.
> Food for thought:
> https://github.com/apache/hudi/pull/2398#discussion_r550711216
>
> *Change #3*
> There are places, where we call fs.listStatus() directly. We should make them
> go through the HoodieTable.getMetadata()... route as well. Essentially, all
> listing should be concentrated to `FileSystemBackedTableMetadata`
> !image-2021-01-05-10-00-35-187.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)