[
https://issues.apache.org/jira/browse/HUDI-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan updated HUDI-3068:
--------------------------------------
Description:
If a user runs hive sync occationally and if archival kicked in and trimmed
some commits and if there were partitions added during those commits which was
never updated later, hive sync will miss out those partitions.
{code:java}
LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ",
Getting commits since then");
return
TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
.findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
} {code}
bcoz, we for recurrent syncs, we always fetch new commits from timeline after
the last synced instant and fetch commit metadata and go on to fetch the
partitions added as part of it.
We can add a new config to hive sync tool to override this behavior.
--sync-all-partitions
when this config is set to true, we should ignore last synced instant and
should go the below route which is done when syncing for the first time.
{code:java}
if (!lastCommitTimeSynced.isPresent()) {
LOG.info("Last commit time synced is not known, listing all partitions in " +
basePath + ",FS :" + fs);
HoodieLocalEngineContext engineContext = new
HoodieLocalEngineContext(metaClient.getHadoopConf());
return FSUtils.getAllPartitionPaths(engineContext, basePath,
useFileListingFromMetadata, assumeDatePartitioning);
} {code}
Ref issue:
https://github.com/apache/hudi/issues/3890
was:
If a user runs hive sync occationally and if archival kicked in and trimmed
some commits and if there were partitions added during those commits which was
never updated later, hive sync will miss out those partitions.
```
return
TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
.findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
```
bcoz, we for recurrent syncs, we always fetch new commits from timeline after
the last synced instant and fetch commit metadata and go on to fetch the
partitions added as part of it.
We can add a new config to hive sync tool to override this behavior.
--sync-all-partitions
when this config is set to true, we should ignore last synced instant and
should go the below route which is done when syncing for the first time.
```
if (!lastCommitTimeSynced.isPresent()) {
LOG.info("Last commit time synced is not known, listing all partitions in " +
basePath + ",FS :" + fs);
HoodieLocalEngineContext engineContext = new
HoodieLocalEngineContext(metaClient.getHadoopConf());
return FSUtils.getAllPartitionPaths(engineContext, basePath,
useFileListingFromMetadata, assumeDatePartitioning);
}
```
> Add support to sync all partitions in hive sync tool
> ----------------------------------------------------
>
> Key: HUDI-3068
> URL: https://issues.apache.org/jira/browse/HUDI-3068
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Hive Integration
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Labels: sev:critical
>
> If a user runs hive sync occationally and if archival kicked in and trimmed
> some commits and if there were partitions added during those commits which
> was never updated later, hive sync will miss out those partitions.
> {code:java}
> LOG.info("Last commit time synced is " + lastCommitTimeSynced.get() + ",
> Getting commits since then");
> return
> TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().getCommitsTimeline()
> .findInstantsAfter(lastCommitTimeSynced.get(), Integer.MAX_VALUE));
> } {code}
> bcoz, we for recurrent syncs, we always fetch new commits from timeline after
> the last synced instant and fetch commit metadata and go on to fetch the
> partitions added as part of it.
>
> We can add a new config to hive sync tool to override this behavior.
> --sync-all-partitions
> when this config is set to true, we should ignore last synced instant and
> should go the below route which is done when syncing for the first time.
>
> {code:java}
> if (!lastCommitTimeSynced.isPresent()) {
> LOG.info("Last commit time synced is not known, listing all partitions in "
> + basePath + ",FS :" + fs);
> HoodieLocalEngineContext engineContext = new
> HoodieLocalEngineContext(metaClient.getHadoopConf());
> return FSUtils.getAllPartitionPaths(engineContext, basePath,
> useFileListingFromMetadata, assumeDatePartitioning);
> } {code}
>
>
> Ref issue:
> https://github.com/apache/hudi/issues/3890
--
This message was sent by Atlassian Jira
(v8.20.1#820001)