sassai opened a new issue #1962:
URL: https://github.com/apache/hudi/issues/1962


   **Describe the problem you faced**
   
   I'm running a spark structured streaming application that reads data from 
kafka and saves it to a partitioned Hudi MERGE_ON_READ table. Hive sync is 
enabled and I'm able to query the table with the Hive CLI, e.g.:
   
   SELECT * FROM iot_device_ro LIMIT 5; 
   
   ```console
   
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
   | iot_device_ro._hoodie_commit_time  | iot_device_ro._hoodie_commit_seqno  | 
     iot_device_ro._hoodie_record_key      |    
iot_device_ro._hoodie_partition_path     |          
iot_device_ro._hoodie_file_name           | iot_device_ro.deviceid  | 
iot_device_ro.sensorid  | iot_device_ro.measurement  | iot_device_ro.measure_ts 
 |          iot_device_ro.uuid           |  iot_device_ro.its   | 
iot_device_ro.year  | iot_device_ro.month  | iot_device_ro.day  | 
iot_device_ro.hour  | iot_device_ro.minute  |
   
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
   | 20200813121124                     | 20200813121124_0_1                  | 
uuid:3d387a37-f288-456b-87b7-2b6865cf32e0  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4              | 1                       | 30.228266831690732         
| 2020-08-13T08:39:04.528Z  | 3d387a37-f288-456b-87b7-2b6865cf32e0  | 
2020-08-13 12:11:24  | 2020                | 8                    | 13          
       | 12                  | 11                    |
   | 20200813121124                     | 20200813121124_0_2                  | 
uuid:5bed809e-758f-46dc-b1ab-837ad3eb5a6a  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4              | 1                       | 31.453188991515226         
| 2020-08-13T08:39:19.588Z  | 5bed809e-758f-46dc-b1ab-837ad3eb5a6a  | 
2020-08-13 12:11:24  | 2020                | 8                    | 13          
       | 12                  | 11                    |
   | 20200813121124                     | 20200813121124_0_3                  | 
uuid:6d37be34-6e4b-49b0-b3fe-e6552c2aee22  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4              | 1                       | 34.68735798194983          
| 2020-08-13T07:45:05.958Z  | 6d37be34-6e4b-49b0-b3fe-e6552c2aee22  | 
2020-08-13 12:11:24  | 2020                | 8                    | 13          
       | 12                  | 11                    |
   | 20200813121124                     | 20200813121124_0_4                  | 
uuid:5c2dbea8-9668-4652-84c6-c82d06aa2805  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4              | 1                       | 33.680806905962264         
| 2020-08-12T13:33:20.159Z  | 5c2dbea8-9668-4652-84c6-c82d06aa2805  | 
2020-08-13 12:11:24  | 2020                | 8                    | 13          
       | 12                  | 11                    |
   | 20200813121124                     | 20200813121124_0_5                  | 
uuid:528e6c74-bb44-49da-aa76-059781cc7676  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4              | 1                       | 31.38529683936205          
| 2020-08-13T10:57:58.448Z  | 528e6c74-bb44-49da-aa76-059781cc7676  | 
2020-08-13 12:11:24  | 2020                | 8                    | 13          
       | 12                  | 11                    |
   
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
   ```
   
   However if I want to apply a filter on the partition colum the result is 
empty:
   
   SELECT * FROM iot_device_ro WHERE day=13 LIMIT 10;
   
   ```console
   
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
   | iot_device_ro._hoodie_commit_time  | iot_device_ro._hoodie_commit_seqno  | 
iot_device_ro._hoodie_record_key  | iot_device_ro._hoodie_partition_path  | 
iot_device_ro._hoodie_file_name  | iot_device_ro.deviceid  | 
iot_device_ro.sensorid  | iot_device_ro.measurement  | iot_device_ro.measure_ts 
 | iot_device_ro.uuid  | iot_device_ro.its  | iot_device_ro.year  | 
iot_device_ro.month  | iot_device_ro.day  | iot_device_ro.hour  | 
iot_device_ro.minute  |
   
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
   
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
   No rows selected (70.91 secon
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   Hudi Datasource Configuration in Spark:
   
   ```java
   Dataset<Row> output =
           streamingInput.withColumn("its", date_format(current_timestamp(), 
DATE_FORMAT));
       DataStreamWriter<Row> writer =
           output
               .writeStream()
               .format("hudi")
               .option("hoodie.insert.shuffle.parallelism", "2")
               .option("hoodie.upsert.shuffle.parallelism", "2")
               .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), tableType)
               .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "uuid")
               .option(
                   DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(),
                   ComplexKeyGenerator.class.getCanonicalName())
               .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
partitions)
               .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "its")
               .option(HoodieWriteConfig.TABLE_NAME, tableName)
               .option("checkpointLocation", streamingCheckpointingPath)
               
.option(DataSourceWriteOptions.STREAMING_IGNORE_FAILED_BATCH_OPT_KEY(), "false")
               .option(DataSourceWriteOptions.STREAMING_RETRY_CNT_OPT_KEY(), 
"10")
               .outputMode(OutputMode.Append())
               .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), tableName)
               .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), hiveDB)
               .option(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), hiveJdbcUrl)
               
.option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY(), "true")
               .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(), 
"true")
               .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(), 
partitions)
               .option(
                       
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(),
                       MultiPartKeysValueExtractor.class.getCanonicalName());
   ```
   
   Hudi Table Description in Hive:
   
   ```console
   +--------------------------+------------+----------+
   |         col_name         | data_type  | comment  |
   +--------------------------+------------+----------+
   | _hoodie_commit_time      | string     |          |
   | _hoodie_commit_seqno     | string     |          |
   | _hoodie_record_key       | string     |          |
   | _hoodie_partition_path   | string     |          |
   | _hoodie_file_name        | string     |          |
   | deviceid                 | string     |          |
   | sensorid                 | string     |          |
   | measurement              | double     |          |
   | measure_ts               | string     |          |
   | uuid                     | string     |          |
   | its                      | string     |          |
   | year                     | int        |          |
   | month                    | int        |          |
   | day                      | int        |          |
   | hour                     | int        |          |
   | minute                   | int        |          |
   |                          | NULL       | NULL     |
   | # Partition Information  | NULL       | NULL     |
   | # col_name               | data_type  | comment  |
   | year                     | int        |          |
   | month                    | int        |          |
   | day                      | int        |          |
   | hour                     | int        |          |
   | minute                   | int        |          |
   +--------------------------+------------+----------+
   ```
   
   Hive Partitions created by HudiHiveSync:
   
   ```console
   +---------------------------------------------+
   |                  partition                  |
   +---------------------------------------------+
   | year=2020/month=8/day=13/hour=12/minute=11  |
   | year=2020/month=8/day=13/hour=12/minute=17  |
   | year=2020/month=8/day=13/hour=12/minute=18  |
   | year=2020/month=8/day=13/hour=12/minute=19  |
   | year=2020/month=8/day=13/hour=12/minute=20  |
   | year=2020/month=8/day=13/hour=12/minute=21  |
   ``` 
   
   Additional table information:
   
   ```console
   
   Detailed Table Information
   --
   Database: | xxx|  
   OwnerType: | USER |  
   Owner: | xxx|  
   CreateTime: | Thu Aug 13 12:11:49 UTC 2020 |  
   LastAccessTime: | UNKNOWN |  
   Retention: | 0 |  
   Location: | 
abfs://[email protected]/data/hudi/streaming/tables/xxx/iot_device 
|  
   Table Type: | EXTERNAL_TABLE |  
   Table Parameters: |   |  
     | EXTERNAL | TRUE
     | bucketing_version | 2
     | discover.partitions | true
     | last_commit_time_sync | 20200813133453
     | numFiles | 77
     | numPartitions | 68
     | numRows | 0
     | rawDataSize | 0
     | spark.sql.create.version | 2.2 or prior
     | spark.sql.sources.schema.numPartCols | 5
     | spark.sql.sources.schema.numParts | 1
     | spark.sql.sources.schema.part.0 | 
{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"deviceid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sensorid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"measurement\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"measure_ts\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"uuid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"its\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"y
 
ear\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"month\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"day\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"hour\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"minute\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
     | spark.sql.sources.schema.partCol.0 | year
     | spark.sql.sources.schema.partCol.1 | month
     | spark.sql.sources.schema.partCol.2 | day
     | spark.sql.sources.schema.partCol.3 | hour
     | spark.sql.sources.schema.partCol.4 | minute
     | totalSize | 34194101
     | transient_lastDdlTime | 1597324063
     |   |  
   Storage Information
   SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe 
|  
   InputFormat: | org.apache.hudi.hadoop.HoodieParquetInputFormat |  
   OutputFormat: | 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat |  
   Compressed: | No |  
   Num Buckets: | -1 |  
   Bucket Columns: | [] |  
   Sort Columns: | [] |  
   Storage Desc Params: |   |  
     | serialization.format | 1
   
   ```
   
   Strangely enough if I run the same query in Spark everything works as 
expected.
   
   **Expected behavior**
   
   Filter on partition columns can be applied on Hudi tables in Hive CLI.
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   
   * Spark version : 2.4.0
   
   * Hive version : 3.1
   
   * Hadoop version : 3
   
   * Storage (HDFS/S3/GCS..) : ADLS
   
   * Running on Docker? (yes/no) : no
   
   What am I missing? Any help is very much appreciated!
   
   Thank you in advance!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to