[GitHub] [hudi] BenjMaq opened a new issue #4154: [SUPPORT] INSERT OVERWRITE operation does not work when using Spark SQL

GitBox Mon, 29 Nov 2021 10:03:18 -0800


BenjMaq opened a new issue #4154:
URL: https://github.com/apache/hudi/issues/4154



   **Describe the problem you faced**
   
   _Disclaimer: Creating and inserting into external hive tables stored on S3._
   
   - The `INSERT OVERWRITE` operation does not work when using spark SQL. When 
running `INSERT OVERWRITE` on an existing partition, the parquet files get 
correctly created (I can see them in S3) but the partition (metadata?) does not 
get updated. When selecting from the same table, the old files are still being 
queried and not the new ones.
   - Also, when running `INSERT OVERWRITE` on an empty table, the parquet files 
get correctly created as well (I see them in S3) but when selecting from the 
table, an empty result set is returned.
   
   **To Reproduce**
   
   1. Create an external table
   ```
   CREATE TABLE IF NOT EXISTS <schema>.<table_name> (
       id bigint,
       name string,
       dt string
   ) USING hudi
   LOCATION 's3a://<bucket>/<object>'
   OPTIONS (
     type = 'cow'
   )
   PARTITIONED by (dt);
   ```
   
   2. Insert some data
   ```
   insert into <schema>.<table_name>
   values
       (1, 'a1', '2021-11-29'),
       (2, 'a2', '2021-11-29')
   ;
   ```
   
   3. Check the table (I use Presto/Hive)
   <body>
   
   _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path | _hoodie_file_name | id | name | dt
   -- | -- | -- | -- | -- | -- | -- | --
   20211129094605 | 20211129094605_0_1 | 65ff84b9-9733-48f6-bb79-d22b239f81c1 | 
dt=2021-11-29 | 
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 1 | a1 | 
2021-11-29
   20211129094605 | 20211129094605_0_2 | 73712448-b8a5-4ec1-a363-f667a520261b | 
dt=2021-11-29 | 
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 2 | a2 | 
2021-11-29
   
   </body>
   
   4. Run `insert overwrite` statement
   ```
   insert overwrite table public.test_overwrite_bm
   values
       (3, 'a3', '2021-11-29'),
       (4, 'a4', '2021-11-29')
   ;
   ```
   
   4. Check the table again => it's the same as before. The `insert overwrite` 
statement did not have any effect
   <body>
   
   
   _hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key | 
_hoodie_partition_path | _hoodie_file_name | id | name | dt
   -- | -- | -- | -- | -- | -- | -- | --
   20211129094605 | 20211129094605_0_1 | 65ff84b9-9733-48f6-bb79-d22b239f81c1 | 
dt=2021-11-29 | 
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 1 | a1 | 
2021-11-29
   20211129094605 | 20211129094605_0_2 | 73712448-b8a5-4ec1-a363-f667a520261b | 
dt=2021-11-29 | 
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 2 | a2 | 
2021-11-29
   
   
   </body>
   
   
   **Expected behavior**
   
   I would expect values of `id` and `name` in the above example to become `3, 
4` and `a3, a3`, respectively.
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.5
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] BenjMaq opened a new issue #4154: [SUPPORT] INSERT OVERWRITE operation does not work when using Spark SQL

Reply via email to