sassai opened a new issue #1962:
URL: https://github.com/apache/hudi/issues/1962
**Describe the problem you faced**
I'm running a spark structured streaming application that reads data from
kafka and saves it to a partitioned Hudi MERGE_ON_READ table. Hive sync is
enabled and I'm able to query the table with the Hive CLI, e.g.:
SELECT * FROM iot_device_ro LIMIT 5;
```console
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
| iot_device_ro._hoodie_commit_time | iot_device_ro._hoodie_commit_seqno |
iot_device_ro._hoodie_record_key |
iot_device_ro._hoodie_partition_path |
iot_device_ro._hoodie_file_name | iot_device_ro.deviceid |
iot_device_ro.sensorid | iot_device_ro.measurement | iot_device_ro.measure_ts
| iot_device_ro.uuid | iot_device_ro.its |
iot_device_ro.year | iot_device_ro.month | iot_device_ro.day |
iot_device_ro.hour | iot_device_ro.minute |
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
| 20200813121124 | 20200813121124_0_1 |
uuid:3d387a37-f288-456b-87b7-2b6865cf32e0 |
year=2020/month=8/day=13/hour=12/minute=11 |
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet |
iotdevice4 | 1 | 30.228266831690732
| 2020-08-13T08:39:04.528Z | 3d387a37-f288-456b-87b7-2b6865cf32e0 |
2020-08-13 12:11:24 | 2020 | 8 | 13
| 12 | 11 |
| 20200813121124 | 20200813121124_0_2 |
uuid:5bed809e-758f-46dc-b1ab-837ad3eb5a6a |
year=2020/month=8/day=13/hour=12/minute=11 |
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet |
iotdevice4 | 1 | 31.453188991515226
| 2020-08-13T08:39:19.588Z | 5bed809e-758f-46dc-b1ab-837ad3eb5a6a |
2020-08-13 12:11:24 | 2020 | 8 | 13
| 12 | 11 |
| 20200813121124 | 20200813121124_0_3 |
uuid:6d37be34-6e4b-49b0-b3fe-e6552c2aee22 |
year=2020/month=8/day=13/hour=12/minute=11 |
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet |
iotdevice4 | 1 | 34.68735798194983
| 2020-08-13T07:45:05.958Z | 6d37be34-6e4b-49b0-b3fe-e6552c2aee22 |
2020-08-13 12:11:24 | 2020 | 8 | 13
| 12 | 11 |
| 20200813121124 | 20200813121124_0_4 |
uuid:5c2dbea8-9668-4652-84c6-c82d06aa2805 |
year=2020/month=8/day=13/hour=12/minute=11 |
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet |
iotdevice4 | 1 | 33.680806905962264
| 2020-08-12T13:33:20.159Z | 5c2dbea8-9668-4652-84c6-c82d06aa2805 |
2020-08-13 12:11:24 | 2020 | 8 | 13
| 12 | 11 |
| 20200813121124 | 20200813121124_0_5 |
uuid:528e6c74-bb44-49da-aa76-059781cc7676 |
year=2020/month=8/day=13/hour=12/minute=11 |
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet |
iotdevice4 | 1 | 31.38529683936205
| 2020-08-13T10:57:58.448Z | 528e6c74-bb44-49da-aa76-059781cc7676 |
2020-08-13 12:11:24 | 2020 | 8 | 13
| 12 | 11 |
+------------------------------------+-------------------------------------+--------------------------------------------+---------------------------------------------+----------------------------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------------------------+----------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
```
However if I want to apply a filter on the partition colum the result is
empty:
SELECT * FROM iot_device_ro WHERE day=13 LIMIT 10;
```console
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
| iot_device_ro._hoodie_commit_time | iot_device_ro._hoodie_commit_seqno |
iot_device_ro._hoodie_record_key | iot_device_ro._hoodie_partition_path |
iot_device_ro._hoodie_file_name | iot_device_ro.deviceid |
iot_device_ro.sensorid | iot_device_ro.measurement | iot_device_ro.measure_ts
| iot_device_ro.uuid | iot_device_ro.its | iot_device_ro.year |
iot_device_ro.month | iot_device_ro.day | iot_device_ro.hour |
iot_device_ro.minute |
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
+------------------------------------+-------------------------------------+-----------------------------------+---------------------------------------+----------------------------------+-------------------------+-------------------------+----------------------------+---------------------------+---------------------+--------------------+---------------------+----------------------+--------------------+---------------------+-----------------------+
No rows selected (70.91 secon
```
**To Reproduce**
Steps to reproduce the behavior:
Hudi Datasource Configuration in Spark:
```java
Dataset<Row> output =
streamingInput.withColumn("its", date_format(current_timestamp(),
DATE_FORMAT));
DataStreamWriter<Row> writer =
output
.writeStream()
.format("hudi")
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), tableType)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "uuid")
.option(
DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(),
ComplexKeyGenerator.class.getCanonicalName())
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),
partitions)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "its")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option("checkpointLocation", streamingCheckpointingPath)
.option(DataSourceWriteOptions.STREAMING_IGNORE_FAILED_BATCH_OPT_KEY(), "false")
.option(DataSourceWriteOptions.STREAMING_RETRY_CNT_OPT_KEY(),
"10")
.outputMode(OutputMode.Append())
.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), tableName)
.option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), hiveDB)
.option(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), hiveJdbcUrl)
.option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY(), "true")
.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(),
"true")
.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(),
partitions)
.option(
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(),
MultiPartKeysValueExtractor.class.getCanonicalName());
```
Hudi Table Description in Hive:
```console
+--------------------------+------------+----------+
| col_name | data_type | comment |
+--------------------------+------------+----------+
| _hoodie_commit_time | string | |
| _hoodie_commit_seqno | string | |
| _hoodie_record_key | string | |
| _hoodie_partition_path | string | |
| _hoodie_file_name | string | |
| deviceid | string | |
| sensorid | string | |
| measurement | double | |
| measure_ts | string | |
| uuid | string | |
| its | string | |
| year | int | |
| month | int | |
| day | int | |
| hour | int | |
| minute | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| year | int | |
| month | int | |
| day | int | |
| hour | int | |
| minute | int | |
+--------------------------+------------+----------+
```
Hive Partitions created by HudiHiveSync:
```console
+---------------------------------------------+
| partition |
+---------------------------------------------+
| year=2020/month=8/day=13/hour=12/minute=11 |
| year=2020/month=8/day=13/hour=12/minute=17 |
| year=2020/month=8/day=13/hour=12/minute=18 |
| year=2020/month=8/day=13/hour=12/minute=19 |
| year=2020/month=8/day=13/hour=12/minute=20 |
| year=2020/month=8/day=13/hour=12/minute=21 |
```
Additional table information:
```console
Detailed Table Information
--
Database: | xxx|
OwnerType: | USER |
Owner: | xxx|
CreateTime: | Thu Aug 13 12:11:49 UTC 2020 |
LastAccessTime: | UNKNOWN |
Retention: | 0 |
Location: |
abfs://[email protected]/data/hudi/streaming/tables/xxx/iot_device
|
Table Type: | EXTERNAL_TABLE |
Table Parameters: | |
| EXTERNAL | TRUE
| bucketing_version | 2
| discover.partitions | true
| last_commit_time_sync | 20200813133453
| numFiles | 77
| numPartitions | 68
| numRows | 0
| rawDataSize | 0
| spark.sql.create.version | 2.2 or prior
| spark.sql.sources.schema.numPartCols | 5
| spark.sql.sources.schema.numParts | 1
| spark.sql.sources.schema.part.0 |
{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"deviceid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"sensorid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"measurement\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"measure_ts\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"uuid\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"its\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"y
ear\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"month\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"day\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"hour\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"minute\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
| spark.sql.sources.schema.partCol.0 | year
| spark.sql.sources.schema.partCol.1 | month
| spark.sql.sources.schema.partCol.2 | day
| spark.sql.sources.schema.partCol.3 | hour
| spark.sql.sources.schema.partCol.4 | minute
| totalSize | 34194101
| transient_lastDdlTime | 1597324063
| |
Storage Information
SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
|
InputFormat: | org.apache.hudi.hadoop.HoodieParquetInputFormat |
OutputFormat: |
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat |
Compressed: | No |
Num Buckets: | -1 |
Bucket Columns: | [] |
Sort Columns: | [] |
Storage Desc Params: | |
| serialization.format | 1
```
Strangely enough if I run the same query in Spark everything works as
expected.
**Expected behavior**
Filter on partition columns can be applied on Hudi tables in Hive CLI.
**Environment Description**
* Hudi version : 0.5.3
* Spark version : 2.4.0
* Hive version : 3.1
* Hadoop version : 3
* Storage (HDFS/S3/GCS..) : ADLS
* Running on Docker? (yes/no) : no
What am I missing? Any help is very much appreciated!
Thank you in advance!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]