[GitHub] [incubator-hudi] imperio-wxm opened a new issue #828: Synchronizing to hive partition is incorrect

GitBox Thu, 08 Aug 2019 20:32:47 -0700

imperio-wxm opened a new issue #828: Synchronizing to hive partition is 
incorrect
URL: https://github.com/apache/incubator-hudi/issues/828
 
 
   spark 2.4.0.cloudera1
   hadoop 2.6.0-cdh5.11.1
   hive 1.1.0-cdh5.11.1
   hudi 0.4.7
   
   > I select some data from hive table and wrote a new table with hudi then 
sync to hive.
   
   # My Code
   
   ```java
   Dataset<Row> hiveQuery = spark.sql("select timestamp,key,name,part_date from 
dw.xxxxx where part_date='2019-08-02' limit 10");
   
   hiveQuery.write()
        .format("com.uber.hoodie")
        .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY(), 
true)
        .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
"jdbc:hive2://xxxx:10000")
        .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), "dw")
        .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(), true)
        .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), "hoodie_test")
        .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(), 
"part_date")
        .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "key")
        .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"part_date")
        .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
        .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
        .mode(SaveMode.Append)
        .save("/wxm/hudi/data/hoodie_test");
   ```
   
   **The job success however I found some problems with the hive partition in 
new table.**
   
   ## 1. The partition path is incorrect.
   
   If the data migration is through hive syntax `as select`, then the partition 
should be like this:
   
   ```java
   // insert overwrite table new_table partition(part_date)  select xxx from 
old_table
   /wxm/hudi/data/hoodie_test/part_date=2019-08-02
   ```
   The path I am running with the code above 
is：/wxm/hudi/data/hoodie_test/2019-08-02
   
   The hive partition should be in the form of key=value and hudi missing 
`part_date` field.
   
   ## 2. Hive table has no partition
   
   I use `show partitions table` not find any partition, I think if you set up 
a hive partition, you should add it automatically. This causes the query to 
have no data.
   
   ```java
   hive> show partitions xxx;
   OK
   Time taken: 0.317 seconds
   
   hive> select * from xxx limit 10;
   OK
   Time taken: 0.451 seconds
   ```
   
   ## Manual operation query data
   
   Then I manually added the partition `alter table add 
partition(part_date='2019-08-02')` and moved the file generated by hudi to the 
partition `hadoop fs -cp /wxm/hudi/data/hoodie_test/2019-08-02/* 
/wxm/hudi/data/hoodie_test/part_date=2019-08-02/`
   
   I can select the data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] imperio-wxm opened a new issue #828: Synchronizing to hive partition is incorrect

Reply via email to