BenjMaq opened a new issue #4154:
URL: https://github.com/apache/hudi/issues/4154
**Describe the problem you faced**
_Disclaimer: Creating and inserting into external hive tables stored on S3._
- The `INSERT OVERWRITE` operation does not work when using spark SQL. When
running `INSERT OVERWRITE` on an existing partition, the parquet files get
correctly created (I can see them in S3) but the partition (metadata?) does not
get updated. When selecting from the same table, the old files are still being
queried and not the new ones.
- Also, when running `INSERT OVERWRITE` on an empty table, the parquet files
get correctly created as well (I see them in S3) but when selecting from the
table, an empty result set is returned.
**To Reproduce**
1. Create an external table
```
CREATE TABLE IF NOT EXISTS <schema>.<table_name> (
id bigint,
name string,
dt string
) USING hudi
LOCATION 's3a://<bucket>/<object>'
OPTIONS (
type = 'cow'
)
PARTITIONED by (dt);
```
2. Insert some data
```
insert into <schema>.<table_name>
values
(1, 'a1', '2021-11-29'),
(2, 'a2', '2021-11-29')
;
```
3. Check the table (I use Presto/Hive)
<body>
_hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key |
_hoodie_partition_path | _hoodie_file_name | id | name | dt
-- | -- | -- | -- | -- | -- | -- | --
20211129094605 | 20211129094605_0_1 | 65ff84b9-9733-48f6-bb79-d22b239f81c1 |
dt=2021-11-29 |
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 1 | a1 |
2021-11-29
20211129094605 | 20211129094605_0_2 | 73712448-b8a5-4ec1-a363-f667a520261b |
dt=2021-11-29 |
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 2 | a2 |
2021-11-29
</body>
4. Run `insert overwrite` statement
```
insert overwrite table public.test_overwrite_bm
values
(3, 'a3', '2021-11-29'),
(4, 'a4', '2021-11-29')
;
```
4. Check the table again => it's the same as before. The `insert overwrite`
statement did not have any effect
<body>
_hoodie_commit_time | _hoodie_commit_seqno | _hoodie_record_key |
_hoodie_partition_path | _hoodie_file_name | id | name | dt
-- | -- | -- | -- | -- | -- | -- | --
20211129094605 | 20211129094605_0_1 | 65ff84b9-9733-48f6-bb79-d22b239f81c1 |
dt=2021-11-29 |
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 1 | a1 |
2021-11-29
20211129094605 | 20211129094605_0_2 | 73712448-b8a5-4ec1-a363-f667a520261b |
dt=2021-11-29 |
e80b1f6e-d066-416f-8442-95c4172c94cf-0_0-5-8_20211129094605.parquet | 2 | a2 |
2021-11-29
</body>
**Expected behavior**
I would expect values of `id` and `name` in the above example to become `3,
4` and `a3, a3`, respectively.
**Environment Description**
* Hudi version : 0.9.0
* Spark version : 2.4.4
* Hive version : 2.3.5
* Hadoop version :
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Additional context**
Add any other context about the problem here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]