This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new ebca66b5b1 [DOCS] Add faq, talk and edit community sync page (#7210)
ebca66b5b1 is described below
commit ebca66b5b12ffc625d62d5b6030616e468160129
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Wed Nov 16 08:46:58 2022 -0800
[DOCS] Add faq, talk and edit community sync page (#7210)
Co-authored-by: Y Ethan Guo <[email protected]>
---
website/community/syncs.md | 2 ++
website/docs/faq.md | 16 ++++++++++++++++
website/src/pages/talks.md | 4 +++-
...-10-17-Get_started_with_apache_hudi_using_glue.jpeg | Bin 0 -> 113989 bytes
...a_cost_optimized_glue_pipeline_with_apache_hudi.png | Bin 0 -> 56237 bytes
5 files changed, 21 insertions(+), 1 deletion(-)
diff --git a/website/community/syncs.md b/website/community/syncs.md
index 7cb89ea0f0..c4b4db98de 100644
--- a/website/community/syncs.md
+++ b/website/community/syncs.md
@@ -26,6 +26,8 @@ Every month on the Last Wed, 09:00 AM Pacific Time (US and
Canada)([translate to
Uploaded to [Apache Hudi youtube
channel](https://www.youtube.com/channel/UCs7AhE0BWaEPZSChrBR-Muw) after every
call.
+[LINK TO SLIDE
DECKS](https://drive.google.com/drive/folders/1hsq-kerUsHDlJ3WDeysMQGnVTmttzHgB?usp=sharing)
+
**Typical agenda**
* \[15 mins\] Progress updates & Plans (PMC member)
diff --git a/website/docs/faq.md b/website/docs/faq.md
index 18b5b0f676..51e3fdb6af 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -91,6 +91,22 @@ As of September 2019, Hudi can support Spark 2.1+, Hive 2.x,
Hadoop 2.7+ (not Ha
At a high level, Hudi is based on MVCC design that writes data to versioned
parquet/base files and log files that contain changes to the base file. All the
files are stored under a partitioning scheme for the dataset, which closely
resembles how Apache Hive tables are laid out on DFS. Please refer
[here](https://hudi.apache.org/docs/concepts/) for more details.
+### How Hudi handles partition evolution requirements ?
+Hudi recommends keeping coarse grained top level partition paths e.g date(ts)
and within each such partition do clustering in a flexible way to z-order, sort
data based on interested columns. This provides excellent performance by :
minimzing the number of files in each partition, while still packing data that
will be queried together physically closer (what partitioning aims to achieve).
+
+Let's take an example of a table, where we store log_events with two fields
`ts` (time at which event was produced) and `cust_id` (user for which event was
produced) and a common option is to partition by both date(ts) and cust_id.
+Some users may want to start granular with hour(ts) and then later evolve to
new partitioning scheme say date(ts). But this means, the number of partitions
in the table could be very high - 365 days x 1K customers = at-least 365K
potentially small parquet files, that can significantly slow down queries,
facing throttling issues on the actual S3/DFS reads.
+
+For the afore mentioned reasons, we don't recommend mixing different
partitioning schemes within the same table, since it adds operational
complexity, and unpredictable performance.
+Old data stays in old partitions and only new data gets into newer evolved
partitions. If you want to tidy up the table, one has to rewrite all
partition/data anwyay! This is where we suggest start with coarse grained
partitions
+and lean on clustering techniques to optimize for query performance.
+
+We find that most datasets have at-least one high fidelity field, that can be
used as a coarse partition. Clustering strategies in Hudi provide a lot of
power - you can alter which partitions to cluster, and which fields to cluster
each by etc.
+Unlike Hive partitioning, Hudi does not remove the partition field from the
data files i.e if you write new partition paths, it does not mean old
partitions need to be rewritten.
+Partitioning by itself is a relic of the Hive era; Hudi is working on
replacing partitioning with database like indexing schemes/functions,
+for even more flexibility and get away from Hive-style partition evol route.
+
+
## Using Hudi
### What are some ways to write a Hudi dataset?
diff --git a/website/src/pages/talks.md b/website/src/pages/talks.md
index 0d2b98e4eb..57756f0ece 100644
--- a/website/src/pages/talks.md
+++ b/website/src/pages/talks.md
@@ -96,4 +96,6 @@ Data Summit Connect, May, 2021
41. ["Presto Tech Talk: Optimizing table layout for Presto using Apache
Hudi"](https://www.youtube.com/watch?v=J1JuHVFdggs) - By Ethan Guo and Vinoth
Chandar. Presto Meetup. Jun 23, 2022
-42. ["PrestoDB and Apache Hudi for the
Lakehouse"](https://www.youtube.com/watch?v=3zQJR-IGH0Y&list=PLJVeO1NMmyqXHoLuUJtulMDU0yBgSL0GH&index=11)
- By Sagar Sumit and Bhavani Sudha Saktheeswaran. PrestoCon Day. Jul 21, 2022
+42. ["PrestoDB and Apache Hudi for the
Lakehouse"](https://www.youtube.com/watch?v=3zQJR-IGH0Y&list=PLJVeO1NMmyqXHoLuUJtulMDU0yBgSL0GH&index=11)
- By Sagar Sumit and Bhavani Sudha Saktheeswaran. PrestoCon Day. Jul 21, 2022
+
+43. ["Petabyte-scale lakehouses with dbt and Apache
Hudi"](https://youtu.be/aTn5dkm6rqQ) - By Vinoth Govindarajan and Vinoth
Chandar. Oct 17, 2022
diff --git
a/website/static/assets/images/blog/2022-10-17-Get_started_with_apache_hudi_using_glue.jpeg
b/website/static/assets/images/blog/2022-10-17-Get_started_with_apache_hudi_using_glue.jpeg
new file mode 100644
index 0000000000..d102bf2b33
Binary files /dev/null and
b/website/static/assets/images/blog/2022-10-17-Get_started_with_apache_hudi_using_glue.jpeg
differ
diff --git
a/website/static/assets/images/blog/2022-11-10_How_to_build_a_cost_optimized_glue_pipeline_with_apache_hudi.png
b/website/static/assets/images/blog/2022-11-10_How_to_build_a_cost_optimized_glue_pipeline_with_apache_hudi.png
new file mode 100644
index 0000000000..498fb0f4e7
Binary files /dev/null and
b/website/static/assets/images/blog/2022-11-10_How_to_build_a_cost_optimized_glue_pipeline_with_apache_hudi.png
differ