[GitHub] [hudi] vinothchandar commented on a change in pull request #4344: [MINOR] - Publishing Concurrency Blog

GitBox Thu, 16 Dec 2021 15:39:28 -0800


vinothchandar commented on a change in pull request #4344:
URL: https://github.com/apache/hudi/pull/4344#discussion_r770993997




##########
File path: 
website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
##########
@@ -0,0 +1,53 @@
+---
+title: "Lakehouse Concurrency Control: Are we too optimistic?"
+excerpt: "Vinoth Chandar, Creator of Apache Hudi, dives into concurrency 
control mechanisms"
+author: Vinoth Chandar

Review comment:
       I think it has to be "vinoth" to link to my user profile link?

##########
File path: 
website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
##########
@@ -0,0 +1,53 @@
+---
+title: "Lakehouse Concurrency Control: Are we too optimistic?"
+excerpt: "Vinoth Chandar, Creator of Apache Hudi, dives into concurrency 
control mechanisms"
+author: Vinoth Chandar
+category: blog
+---
+
+Transactions on data lakes are now considered a key characteristic of a 
Lakehouse these days. But what has actually been accomplished so far? What are 
the current approaches? How do they fare in real-world scenarios? These 
questions are the focus of this blog. 
+
+<!--truncate-->
+
+Having had the good fortune of working on diverse database projects - an RDBMS 
([Oracle](https://www.oracle.com/database/)), a NoSQL key-value store 
([Voldemort](https://www.slideshare.net/vinothchandar/voldemort-prototype-to-production-nectar-edits)),
 a streaming database 
([ksqlDB](https://www.confluent.io/blog/ksqldb-pull-queries-high-availability/)),
 a closed-source real-time datastore and of course, Apache Hudi, I can safely 
say that the nature of workloads deeply influence the concurrency control 
mechanisms adopted in different databases. This blog will also describe how we 
rethought concurrency control for the data lake in Apache Hudi.
+
+First, let's set the record straight. RDBMS databases offer the richest set of 
transactional capabilities and the widest array of concurrency control 
[mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html).
 Different isolation levels, fine grained locking, deadlock 
detection/avoidance, and more are possible because they have to support 
row-level mutations and reads across many tables while enforcing [key 
constraints](https://dev.mysql.com/doc/refman/8.0/en/create-table-foreign-keys.html)
 and maintaining 
[indexes](https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html).
 NoSQL stores offer dramatically weaker guarantees like eventual-consistency 
and simple row level atomicity in exchange for greater scalability for simpler 
workloads. Drawing a similar parallel, traditional data warehouses offer more 
or less the full set of capabilities that you would find in an RDBMS, over 
columnar data, with locking and key constraints [enforce
 d](https://docs.teradata.com/r/a8IdS6iVHR77Z9RrIkmMGg/wFPZS4jwZgSG21GnOIpEsw) 
whereas cloud data warehouses seem to have focused a lot more on separating the 
data and compute in architecture, while offering fewer isolation levels. As a 
surprising example, [no 
enforcement](https://docs.snowflake.com/en/sql-reference/constraints-overview.html#supported-constraint-types)
 of key constraints!
+
+# Pitfalls in Lake Concurrency Control
+
+Historically, data lakes have been viewed as batch jobs reading/writing files 
on cloud storage and it's interesting to see how most new work extends this 
view and implements glorified file version control using some form of 
"[**Optimistic concurrency 
control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)" 
(OCC). With OCC jobs take a table level lock to check if they have impacted 
overlapping files and if a conflict exists, they abort their operations 
completely. Without naming names, the lock is sometimes even just a JVM level 
lock held on a single Apache Spark driver node. Once again, this may be okay 
for lightweight coordination of old school batch jobs that mostly append files 
to tables, but cannot be applied broadly to modern data lake workloads. Such 
approaches are built with immutable/append-only data models in mind, which are 
inadequate for incremental data processing or keyed updates/deletes. OCC is 
very optimistic that real contention never happens. Develo
 per evangelism comparing OCC to the full fledged transactional capabilities of 
an RDBMS or a traditional data warehouse is rather misinformed. Quoting 
Wikipedia directly - "_if contention for data resources is frequent, the cost 
of repeatedly restarting transactions hurts performance significantly, in which 
case other_ [_concurrency 
control_](https://en.wikipedia.org/wiki/Concurrency_control) _methods may be 
better suited._ " When conflicts do occur, they can cause massive resource 
wastage since you have a batch job that fails after it ran for a few hours, 
during every attempt!
+
+Imagine a real-life scenario of two writer processes : an ingest writer job 
producing new data every 30 minutes and a deletion writer job that is enforcing 
GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap 
files with random deletes, and the deletion job is almost guaranteed to starve 
and fail to commit each time. In database speak, mixing long running 
transactions with optimism leads to disappointment, since the longer the 
transactions the higher the probability they will overlap.
+
+![concurrency](/assets/images/blog/concurrency/ConcurrencyControlConflicts.png)
+                
static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png
+
+So, what's the alternative? Locking? Wikipedia also says - "_However, 
locking-based ("pessimistic") methods also can deliver poor performance because 
locking can drastically limit effective concurrency even when deadlocks are 
avoided."._ Here is where Hudi takes a different approach, that we believe is 
more apt for modern lake transactions which are typically long-running and even 
continuous. Data lake workloads share more characteristics with high throughput 
stream processing jobs, than they do to standard reads/writes from a database 
and this is where we build strength. In stream processing events are serialized 
into a single ordered log, avoiding any locks/concurrency bottlenecks and you 
can continuously process millions of events/sec. Hudi implements a file level, 
log based concurrency control protocol on the Hudi 
[timeline](https://hudi.apache.org/docs/timeline), which in-turn relies on bare 
minimum atomic puts to cloud storage. By building on an event log as the 
central piece 
 for inter process coordination, Hudi is able to offer a few flexible 
deployment models that offer greater concurrency over pure OCC approaches.

Review comment:
       > over OCC approaches that just track table snapshots?

##########
File path: 
website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
##########
@@ -0,0 +1,53 @@
+---
+title: "Lakehouse Concurrency Control: Are we too optimistic?"
+excerpt: "Vinoth Chandar, Creator of Apache Hudi, dives into concurrency 
control mechanisms"
+author: Vinoth Chandar
+category: blog
+---
+
+Transactions on data lakes are now considered a key characteristic of a 
Lakehouse these days. But what has actually been accomplished so far? What are 
the current approaches? How do they fare in real-world scenarios? These 
questions are the focus of this blog. 
+
+<!--truncate-->
+
+Having had the good fortune of working on diverse database projects - an RDBMS 
([Oracle](https://www.oracle.com/database/)), a NoSQL key-value store 
([Voldemort](https://www.slideshare.net/vinothchandar/voldemort-prototype-to-production-nectar-edits)),
 a streaming database 
([ksqlDB](https://www.confluent.io/blog/ksqldb-pull-queries-high-availability/)),
 a closed-source real-time datastore and of course, Apache Hudi, I can safely 
say that the nature of workloads deeply influence the concurrency control 
mechanisms adopted in different databases. This blog will also describe how we 
rethought concurrency control for the data lake in Apache Hudi.
+
+First, let's set the record straight. RDBMS databases offer the richest set of 
transactional capabilities and the widest array of concurrency control 
[mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html).
 Different isolation levels, fine grained locking, deadlock 
detection/avoidance, and more are possible because they have to support 
row-level mutations and reads across many tables while enforcing [key 
constraints](https://dev.mysql.com/doc/refman/8.0/en/create-table-foreign-keys.html)
 and maintaining 
[indexes](https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html).
 NoSQL stores offer dramatically weaker guarantees like eventual-consistency 
and simple row level atomicity in exchange for greater scalability for simpler 
workloads. Drawing a similar parallel, traditional data warehouses offer more 
or less the full set of capabilities that you would find in an RDBMS, over 
columnar data, with locking and key constraints [enforce
 d](https://docs.teradata.com/r/a8IdS6iVHR77Z9RrIkmMGg/wFPZS4jwZgSG21GnOIpEsw) 
whereas cloud data warehouses seem to have focused a lot more on separating the 
data and compute in architecture, while offering fewer isolation levels. As a 
surprising example, [no 
enforcement](https://docs.snowflake.com/en/sql-reference/constraints-overview.html#supported-constraint-types)
 of key constraints!
+
+# Pitfalls in Lake Concurrency Control
+
+Historically, data lakes have been viewed as batch jobs reading/writing files 
on cloud storage and it's interesting to see how most new work extends this 
view and implements glorified file version control using some form of 
"[**Optimistic concurrency 
control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)" 
(OCC). With OCC jobs take a table level lock to check if they have impacted 
overlapping files and if a conflict exists, they abort their operations 
completely. Without naming names, the lock is sometimes even just a JVM level 
lock held on a single Apache Spark driver node. Once again, this may be okay 
for lightweight coordination of old school batch jobs that mostly append files 
to tables, but cannot be applied broadly to modern data lake workloads. Such 
approaches are built with immutable/append-only data models in mind, which are 
inadequate for incremental data processing or keyed updates/deletes. OCC is 
very optimistic that real contention never happens. Develo
 per evangelism comparing OCC to the full fledged transactional capabilities of 
an RDBMS or a traditional data warehouse is rather misinformed. Quoting 
Wikipedia directly - "_if contention for data resources is frequent, the cost 
of repeatedly restarting transactions hurts performance significantly, in which 
case other_ [_concurrency 
control_](https://en.wikipedia.org/wiki/Concurrency_control) _methods may be 
better suited._ " When conflicts do occur, they can cause massive resource 
wastage since you have a batch job that fails after it ran for a few hours, 
during every attempt!
+
+Imagine a real-life scenario of two writer processes : an ingest writer job 
producing new data every 30 minutes and a deletion writer job that is enforcing 
GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap 
files with random deletes, and the deletion job is almost guaranteed to starve 
and fail to commit each time. In database speak, mixing long running 
transactions with optimism leads to disappointment, since the longer the 
transactions the higher the probability they will overlap.
+
+![concurrency](/assets/images/blog/concurrency/ConcurrencyControlConflicts.png)
+                
static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png
+
+So, what's the alternative? Locking? Wikipedia also says - "_However, 
locking-based ("pessimistic") methods also can deliver poor performance because 
locking can drastically limit effective concurrency even when deadlocks are 
avoided."._ Here is where Hudi takes a different approach, that we believe is 
more apt for modern lake transactions which are typically long-running and even 
continuous. Data lake workloads share more characteristics with high throughput 
stream processing jobs, than they do to standard reads/writes from a database 
and this is where we build strength. In stream processing events are serialized 
into a single ordered log, avoiding any locks/concurrency bottlenecks and you 
can continuously process millions of events/sec. Hudi implements a file level, 
log based concurrency control protocol on the Hudi 
[timeline](https://hudi.apache.org/docs/timeline), which in-turn relies on bare 
minimum atomic puts to cloud storage. By building on an event log as the 
central piece 
 for inter process coordination, Hudi is able to offer a few flexible 
deployment models that offer greater concurrency over pure OCC approaches.

Review comment:
       `where we build strength from` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a change in pull request #4344: [MINOR] - Publishing Concurrency Blog

Reply via email to