[GitHub] [hudi] blrnw3 opened a new issue, #7422: [SUPPORT] Write location changes are not propagated by hive sync

GitBox Fri, 09 Dec 2022 15:50:41 -0800


blrnw3 opened a new issue, #7422:
URL: https://github.com/apache/hudi/issues/7422


   **Describe the problem you faced**
   
   Hive metastore location field does not get updated.
   I created a cow table by writing out a dataframe in Spark. I then did 
additional upserting writes to the same table under a _different path_, but the 
location attribute in the Hive metastore was not updated, even though other 
attributes were. Therefore when reading the table, the new data is not included.
   I assume this is a bug but not sure if this is not intended to be supported 
behavior.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   ```
   write_options = {
     'hoodie.table.name': 'table1',
     'hoodie.datasource.write.table.name': 'table1',
     'hoodie.datasource.write.operation': 'upsert',
     'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
     'hoodie.datasource.write.partitionpath.field': '',
     'hoodie.datasource.write.recordkey.field': 'k1',
     'hoodie.datasource.write.precombine.field': 'x',
     'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
     'hoodie.datasource.write.hive_style_partitioning': 'true',
     'hoodie.index.type': 'SIMPLE',
     'hoodie.datasource.hive_sync.enable': 'true',
     'hoodie.datasource.hive_sync.table': 'table1',
     'hoodie.datasource.hive_sync.database': 'db1',
     'hoodie.datasource.hive_sync.jdbcurl': 'jdbc:hive2://10.1.2.3:10000',
     'hoodie.datasource.hive_sync.support_timestamp': 'true',
     'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.NonPartitionedExtractor'
   }
   
   df1 = spark.createDataFrame([Row(a=10, k1="a1", x=1)])
   
df1.write.format("hudi").options(**write_options).mode("append").save("s3://my-bucket/path1")
   
   # Table now exists in hive and can be queried with
   spark.sql("select count(*) from db1.table1").collect()
   > [Row(count(1)=1)]
   
   # Write some more data (notice changed path)
   df2 = spark.createDataFrame([Row(a=8, k1="a2", x=1)])
   
df2.write.format("hudi").options(**write_options).mode("append").save("s3://my-bucket/path2")
   
   # Read table again, old data 
   spark.sql("select count(*) from db1.table1").collect()
   > [Row(count(1)=1)]
   # Check metastore
   spark.sql("describe table extended 
db1.table1").filter("col_name='Location'").collect()
   > [Row(col_name='Location', data_type='s3://my-bucket/path1', comment='')]
   ```
   
   **Expected behavior**
   Expect the final read to return 2. And location should be 
`s3://my-bucket/path2`
   
   
   **Environment Description**
   AWS EMR 6.5.0 using Glue as the Hive metastore
   
   * Hudi version : 0.9.0-amzn-1
   
   * Spark version : 3.1.2
   
   * Hive version: 3.1.2
   
   * Hadoop version: 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no): no
   
   
   **Additional context**
   n/a
   
   **Stacktrace**
   n/a
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] blrnw3 opened a new issue, #7422: [SUPPORT] Write location changes are not propagated by hive sync

Reply via email to