lamber-ken edited a comment on issue #828: Synchronizing to hive partition is incorrect URL: https://github.com/apache/incubator-hudi/issues/828#issuecomment-564689247 @imperio-wxm, you need to set the value of `DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY()` to `false`. ### Why the first time can't get the data At the first time, the `lastCommitTimeSynced` of the target table is not present, HoodieHiveClient gets all partition paths by `FSUtils.getAllPartitionPaths`. If `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` is set true, the fsutil can only match `basePath + /*/*/*`, but your partition is `basePath + /yyyy-MM-dd`. So it needs to set `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` to `false`. After that, HoodieHiveClient will get all folder partitions, for detail, you can visit `FSUtils#getAllPartitionPaths`. ### Right example ``` import org.apache.spark.sql.SaveMode val basePath = "/flink/hudi/hoodie_test" var datas = List("{ \"key\": \"uuid\", \"event_time\": 1574297893836, \"part_date\": \"2019-11-12\"}") val df = spark.read.json(spark.sparkContext.parallelize(datas, 2)) df.write.format("hudi"). option("hoodie.insert.shuffle.parallelism", "10"). option("hoodie.upsert.shuffle.parallelism", "10"). option("hoodie.delete.shuffle.parallelism", "10"). option("hoodie.bulkinsert.shuffle.parallelism", "10"). option("hoodie.datasource.hive_sync.enable", true). option("hoodie.datasource.hive_sync.jdbcurl", "jdbc:hive2://xxxx:12326"). option("hoodie.datasource.hive_sync.username", "dcadmin"). option("hoodie.datasource.hive_sync.password", "dcadmin"). option("hoodie.datasource.hive_sync.database", "default"). option("hoodie.datasource.hive_sync.table", "hoodie_test"). option("hoodie.datasource.hive_sync.assume_date_partitioning", false). option("hoodie.datasource.hive_sync.partition_fields", "part_date"). option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.DayValueExtractor"). option("hoodie.datasource.write.precombine.field", "event_time"). option("hoodie.datasource.write.recordkey.field", "key"). option("hoodie.datasource.write.partitionpath.field", "part_date"). option("hoodie.table.name", "hoodie_test"). mode(SaveMode.Append). save(basePath); ``` ``` package org.apache.hudi.hive; import com.beust.jcommander.internal.Lists; import org.joda.time.DateTime; import org.joda.time.format.DateTimeFormat; import org.joda.time.format.DateTimeFormatter; import java.util.List; public class DayValueExtractor implements PartitionValueExtractor { private transient DateTimeFormatter dtfOut; public DayValueExtractor() { this.dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd"); } private DateTimeFormatter getDtfOut() { if (dtfOut == null) { dtfOut = DateTimeFormat.forPattern("yyyy-MM-dd"); } return dtfOut; } @Override public List<String> extractPartitionValuesInPath(String partitionPath) { // partition path is expected to be in this format yyyy/mm/dd String[] splits = partitionPath.split("-"); if (splits.length != 3) { throw new IllegalArgumentException( "Partition path " + partitionPath + " is not in the form yyyy-mm-dd "); } // Get the partition part and remove the / as well at the end int year = Integer.parseInt(splits[0]); int mm = Integer.parseInt(splits[1]); int dd = Integer.parseInt(splits[2]); DateTime dateTime = new DateTime(year, mm, dd, 0, 0); return Lists.newArrayList(getDtfOut().print(dateTime)); } } ``` ### Result 
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
