[I] Found duplicate column(s) in the data schema when tries to read COW table in incremental mode with Replacecommit action in timeline [hudi]

via GitHub Wed, 06 Dec 2023 08:16:35 -0800


eg-kazakov opened a new issue, #10258:
URL: https://github.com/apache/hudi/issues/10258


   **Describe the problem you faced**
   
   When spark tries to load a hoodie dataframe from datasource with next hoodie 
timeline setup: 
   ```
   ./.hoodie:
   total 15392
   drwxr-xr-x  9 ekazakov  staff      288 Dec  6 20:35 .
   drwxr-xr-x  5 ekazakov  staff      160 Dec  6 23:00 ..
   -rw-r--r--  1 ekazakov  staff   257944 Dec  6 13:50 
20231206064834739.replacecommit
   -rw-r--r--  1 ekazakov  staff      123 Dec  6 13:50 
20231206064834739.replacecommit.inflight
   -rw-r--r--  1 ekazakov  staff     2522 Dec  6 13:48 
20231206064834739.replacecommit.requested
   -rw-r--r--  1 ekazakov  staff  3350228 Dec  6 14:06 20231206065013087.clean
   -rw-r--r--  1 ekazakov  staff  2128543 Dec  6 14:01 
20231206065013087.clean.inflight
   -rw-r--r--  1 ekazakov  staff  2128543 Dec  6 14:01 
20231206065013087.clean.requested
   -rw-r--r--  1 ekazakov  staff      885 Nov 23 10:44 hoodie.properties
   
   ```
   I am getting:
   org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the 
data schema: `_hoodie_commit_seqno`, `_hoodie_commit_time`, 
`_hoodie_file_name`, `_hoodie_partition_path`, `_hoodie_record_key`
   
   and it fails to load existed data
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. I've attached my hoodie table structure here: [google 
drive](https://drive.google.com/drive/folders/13L21rtKr4Ov4qFg-_0C3ZC6gifb2JeZM?usp=sharing)
   2. Here is code for reproducing:
   ```
           df = sparkSession.sqlContext.read
             // infer trip schema
             .format("hudi")
             .option(QUERY_TYPE.key, QUERY_TYPE_INCREMENTAL_OPT_VAL)
             .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, 
"/movement_type=*/{year=2023/month=12/day=05,year=2023/month=12/day=04}/provider=*/qk=*/*.parquet")
             .option(BEGIN_INSTANTTIME.key, "20231205063117")
             .load("hudi-output-duplicated-columns/")
   ```
   
   
   **Expected behavior**
   
   I am expecting loaded data from hudi table without failures
   
   **Environment Description**
   
   * Hudi version :
   0.13.0
   * Spark version :
   3.3.2
   
   
   
   **Additional context**
   During debug I've noticed that columns got duplicated: 
   <img width="981" alt="Снимок экрана 2023-12-06 в 22 08 03" 
src="https://github.com/apache/hudi/assets/16684232/fc385fe3-2ed9-4ae1-b83e-fe55d284133f";>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Found duplicate column(s) in the data schema when tries to read COW table in incremental mode with Replacecommit action in timeline [hudi]

Reply via email to