[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #828: Synchronizing to hive partition is incorrect

GitBox Wed, 11 Dec 2019 21:59:28 -0800

lamber-ken edited a comment on issue #828: Synchronizing to hive partition is 
incorrect
URL: https://github.com/apache/incubator-hudi/issues/828#issuecomment-564689247
 
 
   @imperio-wxm, you need to set the value of 
`DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY()` to `false`. 
   
   ### Why the first time can't get the data
   
   At the first time, the `lastCommitTimeSynced` of the target table is not 
present, HoodieHiveClient gets all partition paths by 
`FSUtils.getAllPartitionPaths`. If `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` is set 
true, the fsutil can only match `basePath + /*/*/*`, but your partition is 
`basePath + /yyyy-MM-dd`. 
   
   So it needs to set `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` to `false`. After 
that, HoodieHiveClient will get all folder partitions, for detail, you can 
visit `FSUtils#getAllPartitionPaths`.
   
   ### Right example
   ```
   import org.apache.spark.sql.SaveMode
   val basePath = "/flink/hudi/hoodie_test"
   var datas = List("{ \"key\": \"uuid\", \"event_time\": 1574297893836, 
\"part_date\": \"2019-11-12\"}")
   val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
   
   df.write.format("hudi").
       option("hoodie.insert.shuffle.parallelism", "10").
       option("hoodie.upsert.shuffle.parallelism", "10").
       option("hoodie.delete.shuffle.parallelism", "10").
       option("hoodie.bulkinsert.shuffle.parallelism", "10").
   
       option("hoodie.datasource.hive_sync.enable", true).
       option("hoodie.datasource.hive_sync.jdbcurl", "jdbc:hive2://xxxx:12326").
       option("hoodie.datasource.hive_sync.username", "dcadmin").
       option("hoodie.datasource.hive_sync.password", "dcadmin").
       option("hoodie.datasource.hive_sync.database", "default").
       option("hoodie.datasource.hive_sync.table", "hoodie_test").
       option("hoodie.datasource.hive_sync.assume_date_partitioning", false).
       option("hoodie.datasource.hive_sync.partition_fields", "part_date").
   
       option("hoodie.datasource.hive_sync.partition_extractor_class", 
"org.apache.hudi.hive.DayValueExtractor").
   
       option("hoodie.datasource.write.precombine.field", "event_time").
       option("hoodie.datasource.write.recordkey.field", "key").
       option("hoodie.datasource.write.partitionpath.field", "part_date").
   
       option("hoodie.table.name", "hoodie_test").
       mode(SaveMode.Append).
       save(basePath);
   ```
   
   ```
   package org.apache.hudi.hive;
   
   import com.beust.jcommander.internal.Lists;
   import org.joda.time.DateTime;
   import org.joda.time.format.DateTimeFormat;
   import org.joda.time.format.DateTimeFormatter;
   
   import java.util.List;
   
   public class DayValueExtractor implements PartitionValueExtractor {
   
       private transient DateTimeFormatter dtfOut;
   
       public DayValueExtractor() {
           this.dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd");
       }
   
       private DateTimeFormatter getDtfOut() {
           if (dtfOut == null) {
               dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd");
           }
           return dtfOut;
       }
   
       @Override
       public List<String> extractPartitionValuesInPath(String partitionPath) {
           // partition path is expected to be in this format yyyy/mm/dd
           String[] splits = partitionPath.split("-");
           if (splits.length != 3) {
               throw new IllegalArgumentException(
                       "Partition path " + partitionPath + " is not in the form 
yyyy-mm-dd ");
           }
           // Get the partition part and remove the / as well at the end
           int year = Integer.parseInt(splits[0]);
           int mm = Integer.parseInt(splits[1]);
           int dd = Integer.parseInt(splits[2]);
           DateTime dateTime = new DateTime(year, mm, dd, 0, 0);
           return Lists.newArrayList(getDtfOut().print(dateTime));
       }
   }
   ```
   
   ### Result
   
![image](https://user-images.githubusercontent.com/20113411/70652492-a3a9c480-1c8d-11ea-8cbb-964ff3b602a4.png)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #828: Synchronizing to hive partition is incorrect

Reply via email to