[GitHub] [hudi] matthiasdg opened a new issue, #6277: [SUPPORT] HiveSyncTool: missing partitions

GitBox Tue, 02 Aug 2022 02:42:47 -0700


matthiasdg opened a new issue, #6277:
URL: https://github.com/apache/hudi/issues/6277


   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   We have some IoT data tables with a few thousands of partitions; typically 
`deviceId/year/month/day`.
   We do not sync to hive every commit, but at regular intervals.
   For one of these tables I added a few months of historic data for an 
additional set of devices, as opposed to daily updates for the existing set. 
Somehow hive syncing with HiveSyncTool afterwards must have gone wrong 
(unfortunately do not have logs, so not sure if it failed or passed silently 
without detecting some partitions (suspect the latter)) because not all these 
partitions are present in hive. If I now run HiveSyncTool again, I just get 
e.g. `Last commit time synced is 20220802000054258, Getting commits since 
then`, which is what it does; it then picks up added partitions since that 
commit, but the ones that were not synced before are never added.
   
   My current way of solving this is dropping the hive table and rerun 
HiveSyncTool from scratch. This adds all the partitions.
   
   Steps to reproduce the behavior:
   
   1. Have a dataset with a large number of partitions 
`deviceId/year/month/day` (`MultiPartKeysValueExtractor`), sync to hive the 
first time. All is fine though it may take a long time
   2. Adding data to the existing partitions (new months/days will be added), 
syncing to hive still works
   3. Add a large amount of data for devices that were not in the set before, 
sync again -> in my case there are partitions for every new device, but lots of 
the underlying date partitions are missing. 
   4. drop hive table and resync from scratch -> all partitions are there.
   
   **Expected behavior**
   I would expect to either get an error if partitions are not synced, so I do 
not get an updated last commit time synced or to have them all detected 
immediately
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.1.2
   
   * Hive version : client side: 2.3.7 through hudi, standalone metastore 3.0
   
   * Hadoop version : 3.2.0
   
   * Storage (HDFS/S3/GCS..) : Azure Data Lake Gen 2
   
   * Running on Docker? (yes/no) : k8s
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] matthiasdg opened a new issue, #6277: [SUPPORT] HiveSyncTool: missing partitions

Reply via email to