cb149 opened a new issue #4830: URL: https://github.com/apache/hudi/issues/4830
**Describe the problem you faced** I just switched two of my Hudi tables in Impala from PARQUET to HUDIPARQUET, by dropping the old table first and then creating it again using HUDIPARQUET as described in the [documentation](https://hudi.apache.org/docs/querying_data/#impala-34-or-later) One table is partitioned with `year=.../month=...` and is never clustered. This table shows no problems at all. The other table is partitioned with `year=.../month=.../day=...`, data is ingested hourly, every night the previous day is clustered and CLEANER_COMMITS_RETAINED is set to 48. As expected, I no longer see duplicates for both tables in Impala, however, I am getting a weird behavior for one day. If I run ```scala spark.read.format("hudi").load("<myTable>").count ``` I get `76766150602` If I run ```sql SELECT count(*) FROM myTable; ``` in Impala, I get `76614373360` (this is run after ALTER TABLE RECOVER PARTITONS and REFRESH table so that should not be the issue) Looking at the data, it is missing all the rows from the partition `year=2022/month=2/day=13`, for which a count in spark returns `151777242`, which matches the difference between the above counts. If I run ```sql select day,count(*) from myTable WHERE year=2022 and month=2 group by day ORDER BY day ASC; ``` it shows every day from 1 to 16 except for day=13. The first time I ran ```sql select count(*) from myTable WHERE year=2022 and month=1 and day=13; ``` I got 0, not I am getting `150407032` but the count should be `151777242`. However, the count when grouping by day is still missing. Looking at the files, `day=14` is clustered and no file slices have been cleaned yet, `day=13` and all prior partitions have already been clustered and non-clustered files have been cleaned. Is it a coincidence or could this be caused somehow by the fact, that `day=13` is the last partition that is clustered and has been cleaned, with the following days having been clustered but not cleaned (except for the latest day which hasn't been clustered yet) **To Reproduce** Steps to reproduce the behavior: unknown **Expected behavior** HUDIPARQUET should not result in rows missing when querying the table in Impala. **Environment Description** * Hudi version : 0.10.0 * Spark version : 2.4.7 * Hive version : * Hadoop version : 3.1.1 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
