[GitHub] [hudi] cb149 opened a new issue #4830: [SUPPORT] HUDIPARQUET missing the last cleaned partition

GitBox Wed, 16 Feb 2022 12:44:01 -0800


cb149 opened a new issue #4830:
URL: https://github.com/apache/hudi/issues/4830



   **Describe the problem you faced**
   
   I just switched two of my Hudi tables in Impala from PARQUET to HUDIPARQUET, 
by dropping the old table first and then creating it again using HUDIPARQUET as 
described in the 
[documentation](https://hudi.apache.org/docs/querying_data/#impala-34-or-later)
   
   One table is partitioned with `year=.../month=...` and is never clustered. 
This table shows no problems at all.
   
   The other table is partitioned with `year=.../month=.../day=...`, data is 
ingested hourly, every night the previous day is clustered and 
CLEANER_COMMITS_RETAINED is set to 48.
   
   As expected, I no longer see duplicates for both tables in Impala, however, 
I am getting a weird behavior for one day.
   
   If I run
   ```scala
   spark.read.format("hudi").load("<myTable>").count
   ```
   I get `76766150602`
   
   If I run
   ```sql
   SELECT count(*) FROM myTable;
   ```
   in Impala, I get `76614373360` (this is run after ALTER TABLE RECOVER 
PARTITONS and REFRESH table so that should not be the issue)
   
   Looking at the data, it is missing all the rows from the partition 
`year=2022/month=2/day=13`, for which a count in spark returns `151777242`, 
which matches the difference between the above counts.
   
   If I run 
   ```sql
   select day,count(*) from myTable WHERE year=2022 and month=2 group by day 
ORDER BY day ASC;
   ```
   it shows every day from 1 to 16 except for day=13.
   
   The first time I ran 
   ```sql
   select count(*) from myTable WHERE year=2022 and month=1 and day=13;
   ```
   I  got 0, not I am getting `150407032` but the count should be `151777242`. 
However, the count when grouping by day is still missing.
   
   Looking at the files, `day=14` is clustered and no file slices have been 
cleaned yet, `day=13` and all prior partitions have already been clustered and 
non-clustered files have been cleaned.
   
   Is it a coincidence or could this be caused somehow by the fact, that 
`day=13` is the last partition that is clustered and has been cleaned, with the 
following days having been clustered but not cleaned (except for the latest day 
which hasn't been clustered yet)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   unknown
   
   **Expected behavior**
   
   HUDIPARQUET should not result in rows missing when querying the table in 
Impala.
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 2.4.7
   
   * Hive version :
   
   * Hadoop version : 3.1.1
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] cb149 opened a new issue #4830: [SUPPORT] HUDIPARQUET missing the last cleaned partition

Reply via email to