[GitHub] [hudi] mkk1490 opened a new issue #3400: [SUPPORT] CoW table data size increasing x times the original data size for x number of runs

GitBox Wed, 04 Aug 2021 01:13:15 -0700


mkk1490 opened a new issue #3400:
URL: https://github.com/apache/hudi/issues/3400

**_Tips before filing an issue_**

- Have you gone through our
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?

- Join the mailing list to engage in conversations and get faster support at
[email protected].

- If you have triaged this as a bug, then file an
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**
Existing data architecture in datalake:
1. A snapshot of data is carried over for every incremental run onto a new
partition
IDL = 15.6 GB
IDL + 1 = 15.9 GB
IDL + 2 = 16.2 GB
Total = 47.7 GB
I'm working on a PoC for Hudi to decrease snapshotting data for every run
and using a CoW table for upserts. Although my count matches between the CoW
table and the latest snapshot of my datalake table, the size difference is huge.
The size of latest snapshot in datalake is 47.7 GB
Hudi IDL = 17.6 GB
IDL + 1 = 36.5 GB
IDL + 2 = 54.4 GB
The size of the Hudi table is 54.4 GB

It defeats the purpose of migrating to Hudi tables from storage perspective.
The compute time also increased by nearly 50% for the 3rd run.

A clear and concise description of the problem.

**To Reproduce**

Steps to reproduce the behavior:

1. Load sample data into an external table on s3 and check the size
2. Upsert the table with a few records and check the size
3. Check if the size in s3 is doubled

**Expected behavior**
Size to not increase X times for X number of runs
A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version : 0.7.0 installed in EMR 5.33

* Spark version : 2.4.7

* Hive version : 2.3.7

* Hadoop version : Amazon 2.10.1

* Storage (HDFS/S3/GCS..) : s3

* Running on Docker? (yes/no) : No

**Additional context**

Add any other context about the problem here.
Datalake data size:

![image](https://user-images.githubusercontent.com/16716227/128146110-4d2fbeaa-2442-4a69-84aa-e080bb5824c7.png)

Hudi data size for the same partitions:

![image](https://user-images.githubusercontent.com/16716227/128146227-1ab72588-0a99-4c2f-94d8-29b3bbe9ce05.png)

The data count matches for both datalake and Hudi table.

**Stacktrace**

```Add the stacktrace of the error.```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] mkk1490 opened a new issue #3400: [SUPPORT] CoW table data size increasing x times the original data size for x number of runs

Reply via email to