yihua commented on a change in pull request #3527:
URL: https://github.com/apache/hudi/pull/3527#discussion_r701382951



##########
File path: website/blog/2021-08-18-improving-marker-mechanism.md
##########
@@ -0,0 +1,65 @@
+---
+title: "Improving Marker Mechanism in Apache Hudi"
+excerpt: "We introduce a new marker mechanism leveraging the timeline server 
to address performance bottlenecks due to rate-limiting on cloud storage like 
AWS S3."
+author: yihua
+category: blog
+---
+Write operations in an Apache Hudi table use markers to efficiently identify 
the data files written to the file system.  In this blog, we dive into the 
design of the existing direct marker file mechanism and explain its performance 
problem on cloud storage like AWS S3.  We demonstrate how we improve the write 
performance with timeline-server-based markers.
+
+<!--truncate-->
+
+## What is a marker and why it’s needed in write operations on Hudi Table
+ 
+A **marker** in Hudi, such as a marker file with a unique filename, is a label 
to indicate that a corresponding data file exists in the file system.  Each 
marker entry is composed of three parts, the data file name, the marker 
extension (`.marker`), and the I/O type (`CREATE`, `MERGE`, or `APPEND`).  For 
example, the marker 
`91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet.marker.CREATE`
 indicates that the corresponding data file is 
`91245ce3-bb82-4f9f-969e-343364159174-0_140-579-0_20210820173605.parquet` and 
the I/O type is `CREATE`.  Before writing each data file, the Hudi write client 
creates a marker first in the file system.  Markers are persistent in the file 
system unless they are explicitly deleted by the write client.  The write 
client deletes all markers when the commit is successful.

Review comment:
       Reworded a bit in a follow-up PR: 
https://github.com/apache/hudi/pull/3588.  I used "marker" instead of "marker 
file" to be general.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to