Re: updatePartitionsToTable() is time consuming and redundant.

Balaji Varadarajan Sun, 19 Jan 2020 19:29:52 -0800

 Hi Purushotham,
I am unable to reproduce same  partitions getting hive-synced locally. Can you 
add the following log message in HoodieHiveClient.java and run the code and 
send us logs.
diff --git a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java 
b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java


index 4578bb2f..ba4b1147 100644

--- a/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

+++ b/hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java

@@ -237,6 +237,8 @@ public class HoodieHiveClient {

         if (!paths.containsKey(storageValue)) {

           events.add(PartitionEvent.newPartitionAddEvent(storagePartition));

         } else if (!paths.get(storageValue).equals(fullStoragePartitionPath)) {

+          LOG.info("Partition Location changes. StorageVal=" + storageValue

+              + ", Existing Hive Path=" + paths.get(storageValue) + ", New 
Location=" + fullStoragePartitionPath);

           events.add(PartitionEvent.newPartitionUpdateEvent(storagePartition));

         }

       }

THanks,Balaji.V
    On Friday, January 17, 2020, 03:44:08 AM PST, Purushotham Pushpavanthar 
<[email protected]> wrote:  
 
 Hi,

I noticed that
*org.apache.hudi.hive.HoodieHiveClient#updatePartitionsToTable()* is time
consuming while running HUDI on set of records which contains data for
large set of partitions. All it is doing is setting location for each
updated partition path. However,
*org.apache.hudi.hive.HoodieHiveClient#addPartitionsToTable()
*is taking care of adding new partitions to the table.

  1. For a given table, whose base path doesn't change (usually it doesn't
  in production), why *updatePartitionsToTable() *is needed? Can you
  please throw some light on any such case where this is needed?
  2. If it is required, can we do something to optimise the time consumed
  by this operation? Currently, the *Alter Statements* are executed one by
  one on each (partition, path) pair for every updated partition.



Regards,
Purushotham Pushpavanth

Re: updatePartitionsToTable() is time consuming and redundant.

Reply via email to