[hudi] branch asf-site updated: [DOCS] Add faq, talk and edit community sync page (#7210)

yihua Wed, 16 Nov 2022 08:47:13 -0800

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new ebca66b5b1 [DOCS] Add faq, talk and edit community sync page (#7210)
ebca66b5b1 is described below

commit ebca66b5b12ffc625d62d5b6030616e468160129
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Wed Nov 16 08:46:58 2022 -0800

    [DOCS] Add faq, talk and edit community sync page (#7210)
    
    Co-authored-by: Y Ethan Guo <[email protected]>
---
 website/community/syncs.md                             |   2 ++
 website/docs/faq.md                                    |  16 ++++++++++++++++
 website/src/pages/talks.md                             |   4 +++-
 ...-10-17-Get_started_with_apache_hudi_using_glue.jpeg | Bin 0 -> 113989 bytes
 ...a_cost_optimized_glue_pipeline_with_apache_hudi.png | Bin 0 -> 56237 bytes
 5 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/website/community/syncs.md b/website/community/syncs.md
index 7cb89ea0f0..c4b4db98de 100644
--- a/website/community/syncs.md
+++ b/website/community/syncs.md
@@ -26,6 +26,8 @@ Every month on the Last Wed, 09:00 AM Pacific Time (US and 
Canada)([translate to
 
 Uploaded to [Apache Hudi youtube 
channel](https://www.youtube.com/channel/UCs7AhE0BWaEPZSChrBR-Muw) after every 
call.
 
+[LINK TO SLIDE 
DECKS](https://drive.google.com/drive/folders/1hsq-kerUsHDlJ3WDeysMQGnVTmttzHgB?usp=sharing)
+
 **Typical agenda**
 
 *   \[15 mins\] Progress updates & Plans (PMC member)
diff --git a/website/docs/faq.md b/website/docs/faq.md
index 18b5b0f676..51e3fdb6af 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -91,6 +91,22 @@ As of September 2019, Hudi can support Spark 2.1+, Hive 2.x, 
Hadoop 2.7+ (not Ha
 
 At a high level, Hudi is based on MVCC design that writes data to versioned 
parquet/base files and log files that contain changes to the base file. All the 
files are stored under a partitioning scheme for the dataset, which closely 
resembles how Apache Hive tables are laid out on DFS. Please refer 
[here](https://hudi.apache.org/docs/concepts/) for more details.
 
+### How Hudi handles partition evolution requirements ?
+Hudi recommends keeping coarse grained top level partition paths e.g date(ts) 
and within each such partition do clustering in a flexible way to z-order, sort 
data based on interested columns. This provides excellent performance by  : 
minimzing the number of files in each partition, while still packing data that 
will be queried together physically closer (what partitioning aims to achieve).
+
+Let's take an example of a table, where we store log_events with two fields 
`ts` (time at which event was produced) and `cust_id` (user for which event was 
produced) and a common option is to partition by both date(ts) and cust_id.
+Some users may want to start granular with hour(ts) and then later evolve to 
new partitioning scheme say date(ts). But this means, the number of partitions 
in the table could be very high - 365 days x 1K customers = at-least 365K 
potentially small parquet files, that can significantly slow down queries, 
facing throttling issues on the actual S3/DFS reads.
+
+For the afore mentioned reasons, we don't recommend mixing different 
partitioning schemes within the same table, since it adds operational 
complexity, and unpredictable performance. 
+Old data stays in old partitions and only new data gets into newer evolved 
partitions. If you want to tidy up the table, one has to rewrite all 
partition/data anwyay! This is where we suggest start with coarse grained 
partitions
+and lean on clustering techniques to optimize for query performance.
+
+We find that most datasets have at-least one high fidelity field, that can be 
used as a coarse partition. Clustering strategies in Hudi provide a lot of 
power - you can alter which partitions to cluster, and which fields to cluster 
each by etc.
+Unlike Hive partitioning, Hudi does not remove the partition field from the 
data files i.e if you write new partition paths, it does not mean old 
partitions need to be rewritten. 
+Partitioning by itself is a relic of the Hive era; Hudi is working on 
replacing partitioning with database like indexing schemes/functions, 
+for even more flexibility and get away from Hive-style partition evol route.
+
+
 ## Using Hudi
 
 ### What are some ways to write a Hudi dataset?
diff --git a/website/src/pages/talks.md b/website/src/pages/talks.md
index 0d2b98e4eb..57756f0ece 100644
--- a/website/src/pages/talks.md
+++ b/website/src/pages/talks.md
@@ -96,4 +96,6 @@ Data Summit Connect, May, 2021
 
 41. ["Presto Tech Talk: Optimizing table layout for Presto using Apache 
Hudi"](https://www.youtube.com/watch?v=J1JuHVFdggs) - By Ethan Guo and Vinoth 
Chandar. Presto Meetup. Jun 23, 2022
 
-42. ["PrestoDB and Apache Hudi for the 
Lakehouse"](https://www.youtube.com/watch?v=3zQJR-IGH0Y&list=PLJVeO1NMmyqXHoLuUJtulMDU0yBgSL0GH&index=11)
 - By Sagar Sumit and Bhavani Sudha Saktheeswaran. PrestoCon Day. Jul 21, 2022 
+42. ["PrestoDB and Apache Hudi for the 
Lakehouse"](https://www.youtube.com/watch?v=3zQJR-IGH0Y&list=PLJVeO1NMmyqXHoLuUJtulMDU0yBgSL0GH&index=11)
 - By Sagar Sumit and Bhavani Sudha Saktheeswaran. PrestoCon Day. Jul 21, 2022
+
+43. ["Petabyte-scale lakehouses with dbt and Apache 
Hudi"](https://youtu.be/aTn5dkm6rqQ) - By Vinoth Govindarajan and Vinoth 
Chandar.  Oct 17, 2022
diff --git 
a/website/static/assets/images/blog/2022-10-17-Get_started_with_apache_hudi_using_glue.jpeg
 
b/website/static/assets/images/blog/2022-10-17-Get_started_with_apache_hudi_using_glue.jpeg
new file mode 100644
index 0000000000..d102bf2b33
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-10-17-Get_started_with_apache_hudi_using_glue.jpeg
 differ
diff --git 
a/website/static/assets/images/blog/2022-11-10_How_to_build_a_cost_optimized_glue_pipeline_with_apache_hudi.png
 
b/website/static/assets/images/blog/2022-11-10_How_to_build_a_cost_optimized_glue_pipeline_with_apache_hudi.png
new file mode 100644
index 0000000000..498fb0f4e7
Binary files /dev/null and 
b/website/static/assets/images/blog/2022-11-10_How_to_build_a_cost_optimized_glue_pipeline_with_apache_hudi.png
 differ

[hudi] branch asf-site updated: [DOCS] Add faq, talk and edit community sync page (#7210)

Reply via email to