Hi Purushotham,
I am unable to reproduce same partitions getting hive-synced locally. Can you
add the following log message in HoodieHiveClient.java and run the code and
send us logs.
diff --git a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
index 4578bb2f..ba4b1147 100644
--- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
+++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
@@ -237,6 +237,8 @@ public class HoodieHiveClient {
if (!paths.containsKey(storageValue)) {
events.add(PartitionEvent.newPartitionAddEvent(storagePartition));
} else if (!paths.get(storageValue).equals(fullStoragePartitionPath)) {
+ LOG.info("Partition Location changes. StorageVal=" + storageValue
+ + ", Existing Hive Path=" + paths.get(storageValue) + ", New
Location=" + fullStoragePartitionPath);
events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));
}
}
THanks,Balaji.V
On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham Pushpavanthar
<[email protected]> wrote:
Hi,
I noticed that
*org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
consuming while running HUDI on set of records which contains data for
large set of partitions. All it is doing is setting location for each
updated partition path. However,
*org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
*is taking care of adding new partitions to the table.
1. For a given table, whose base path doesn't change (usually it doesn't
in production), why *updatePartitionsToTable() *is needed? Can you
please throw some light on any such case where this is needed?
2. If it is required, can we do something to optimise the time consumed
by this operation? Currently, the *Alter Statements* are executed one by
one on each (partition, path) pair for every updated partition.
Regards,
Purushotham Pushpavanth