[GitHub] [hudi] fisser001 opened a new issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

GitBox Wed, 23 Feb 2022 07:52:26 -0800


fisser001 opened a new issue #4887:
URL: https://github.com/apache/hudi/issues/4887



   **Describe the problem you faced**
   
   We have an unexpected behaviour with partitioned hudi tables when we query 
those tables with impala.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. We write data with hudi and spark to hdfs with the following config:
   `val inputSchema = StructType(
     List(
       StructField("id", StringType, false),
       StructField("attribute", StringType, false),
       StructField("p_year", IntegerType, false),
        StructField("p_month", IntegerType, false),
       StructField("sequence", IntegerType, false)
     )
   )
   
   val initialData = Seq[Row](
     Row("1", "abc", 2019, 1, 1),
     Row("2", "def", 2018, 2, 2),
     Row("3", "ghi", 2018, 3, 3),
   )
   
   val initialDataFrame = 
spark.createDataFrame(spark.sparkContext.parallelize(initialData), inputSchema)
   
   initialDataFrame.write.format("hudi")
   .option(TABLE_NAME.key(), "test")
   .option(RECORDKEY_FIELD.key(), "id")
   .option(PRECOMBINE_FIELD.key(), "sequence")
   .option("hoodie.table.name", "test")
   .option(OPERATION.key(), "insert_overwrite")
   .option(PARTITIONPATH_FIELD.key(), "p_year,p_month")
   .option(KEYGENERATOR_CLASS_NAME.key(), 
"org.apache.hudi.keygen.ComplexKeyGenerator")
   .option(HIVE_STYLE_PARTITIONING.key(), "true")       
   .option(HIVE_SYNC_ENABLED.key(), true)
   .option(HIVE_SYNC_MODE.key(), "HMS")
   .option(HIVE_DATABASE.key(), "db_abc_raw")
   .option(HIVE_TABLE.key(), "test")
   .option(HIVE_CREATE_MANAGED_TABLE.key(), false)
   .mode("append")
   .save("hdfs:///datalake/abc/raw/abc2/abc3/abc4") `
   
   2. After the code has finished data is written to hdfs and a hudi table is 
created in Hive Metastore.
   3. Now it is possible to read the data with spark and also with hive
   4. However, when when we try to read the data with impala, no data is shown
   5. So we execute the following query in order to recover the partitions. 
Result: "Partitions have been recovered.":
   `ALTER TABLE db_abc_raw.test RECOVER PARTITIONS;`
   6. When we execute the following query
   `"show partitions db_abc_raw.test;"  `
   Result: (Please see attachment)
   7. We were able to query the hudi table with hive (tez). No Problems. Data 
is displayed
   8. We were also able to read the data with spark and hudi `.format("hudi") ` 
. No problems here. Data could be read.
   
   **Expected behavior**
   
   It should be possible to query the table with impala and data should be 
displayed.
   
   **Environment Description**
   
   * Hudi version :
   0.10.0 + 0.10.1
   
   * Spark version :
   3.1.1
   
   * Hive version :
   3.1.3000
   
   * Hadoop version :
   3.1.1
   
   * Storage (HDFS/S3/GCS..) :
   HDFS
   
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   - Could be connected with https://github.com/apache/hudi/issues/4830 ?
   
   **Stacktrace**
   
   No stacktrace available.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] fisser001 opened a new issue #4887: [SUPPORT] Unexpected behaviour with partitioned hudi tables with impala as query engine

Reply via email to