vinothchandar commented on a change in pull request #4344:
URL: https://github.com/apache/hudi/pull/4344#discussion_r770993997
##########
File path:
website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
##########
@@ -0,0 +1,53 @@
+---
+title: "Lakehouse Concurrency Control: Are we too optimistic?"
+excerpt: "Vinoth Chandar, Creator of Apache Hudi, dives into concurrency
control mechanisms"
+author: Vinoth Chandar
Review comment:
I think it has to be "vinoth" to link to my user profile link?
##########
File path:
website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
##########
@@ -0,0 +1,53 @@
+---
+title: "Lakehouse Concurrency Control: Are we too optimistic?"
+excerpt: "Vinoth Chandar, Creator of Apache Hudi, dives into concurrency
control mechanisms"
+author: Vinoth Chandar
+category: blog
+---
+
+Transactions on data lakes are now considered a key characteristic of a
Lakehouse these days. But what has actually been accomplished so far? What are
the current approaches? How do they fare in real-world scenarios? These
questions are the focus of this blog.
+
+<!--truncate-->
+
+Having had the good fortune of working on diverse database projects - an RDBMS
([Oracle](https://www.oracle.com/database/)), a NoSQL key-value store
([Voldemort](https://www.slideshare.net/vinothchandar/voldemort-prototype-to-production-nectar-edits)),
a streaming database
([ksqlDB](https://www.confluent.io/blog/ksqldb-pull-queries-high-availability/)),
a closed-source real-time datastore and of course, Apache Hudi, I can safely
say that the nature of workloads deeply influence the concurrency control
mechanisms adopted in different databases. This blog will also describe how we
rethought concurrency control for the data lake in Apache Hudi.
+
+First, let's set the record straight. RDBMS databases offer the richest set of
transactional capabilities and the widest array of concurrency control
[mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html).
Different isolation levels, fine grained locking, deadlock
detection/avoidance, and more are possible because they have to support
row-level mutations and reads across many tables while enforcing [key
constraints](https://dev.mysql.com/doc/refman/8.0/en/create-table-foreign-keys.html)
and maintaining
[indexes](https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html).
NoSQL stores offer dramatically weaker guarantees like eventual-consistency
and simple row level atomicity in exchange for greater scalability for simpler
workloads. Drawing a similar parallel, traditional data warehouses offer more
or less the full set of capabilities that you would find in an RDBMS, over
columnar data, with locking and key constraints [enforce
d](https://docs.teradata.com/r/a8IdS6iVHR77Z9RrIkmMGg/wFPZS4jwZgSG21GnOIpEsw)
whereas cloud data warehouses seem to have focused a lot more on separating the
data and compute in architecture, while offering fewer isolation levels. As a
surprising example, [no
enforcement](https://docs.snowflake.com/en/sql-reference/constraints-overview.html#supported-constraint-types)
of key constraints!
+
+# Pitfalls in Lake Concurrency Control
+
+Historically, data lakes have been viewed as batch jobs reading/writing files
on cloud storage and it's interesting to see how most new work extends this
view and implements glorified file version control using some form of
"[**Optimistic concurrency
control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)"
(OCC). With OCC jobs take a table level lock to check if they have impacted
overlapping files and if a conflict exists, they abort their operations
completely. Without naming names, the lock is sometimes even just a JVM level
lock held on a single Apache Spark driver node. Once again, this may be okay
for lightweight coordination of old school batch jobs that mostly append files
to tables, but cannot be applied broadly to modern data lake workloads. Such
approaches are built with immutable/append-only data models in mind, which are
inadequate for incremental data processing or keyed updates/deletes. OCC is
very optimistic that real contention never happens. Develo
per evangelism comparing OCC to the full fledged transactional capabilities of
an RDBMS or a traditional data warehouse is rather misinformed. Quoting
Wikipedia directly - "_if contention for data resources is frequent, the cost
of repeatedly restarting transactions hurts performance significantly, in which
case other_ [_concurrency
control_](https://en.wikipedia.org/wiki/Concurrency_control) _methods may be
better suited._ " When conflicts do occur, they can cause massive resource
wastage since you have a batch job that fails after it ran for a few hours,
during every attempt!
+
+Imagine a real-life scenario of two writer processes : an ingest writer job
producing new data every 30 minutes and a deletion writer job that is enforcing
GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap
files with random deletes, and the deletion job is almost guaranteed to starve
and fail to commit each time. In database speak, mixing long running
transactions with optimism leads to disappointment, since the longer the
transactions the higher the probability they will overlap.
+
+
+
static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png
+
+So, what's the alternative? Locking? Wikipedia also says - "_However,
locking-based ("pessimistic") methods also can deliver poor performance because
locking can drastically limit effective concurrency even when deadlocks are
avoided."._ Here is where Hudi takes a different approach, that we believe is
more apt for modern lake transactions which are typically long-running and even
continuous. Data lake workloads share more characteristics with high throughput
stream processing jobs, than they do to standard reads/writes from a database
and this is where we build strength. In stream processing events are serialized
into a single ordered log, avoiding any locks/concurrency bottlenecks and you
can continuously process millions of events/sec. Hudi implements a file level,
log based concurrency control protocol on the Hudi
[timeline](https://hudi.apache.org/docs/timeline), which in-turn relies on bare
minimum atomic puts to cloud storage. By building on an event log as the
central piece
for inter process coordination, Hudi is able to offer a few flexible
deployment models that offer greater concurrency over pure OCC approaches.
Review comment:
> over OCC approaches that just track table snapshots?
##########
File path:
website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md
##########
@@ -0,0 +1,53 @@
+---
+title: "Lakehouse Concurrency Control: Are we too optimistic?"
+excerpt: "Vinoth Chandar, Creator of Apache Hudi, dives into concurrency
control mechanisms"
+author: Vinoth Chandar
+category: blog
+---
+
+Transactions on data lakes are now considered a key characteristic of a
Lakehouse these days. But what has actually been accomplished so far? What are
the current approaches? How do they fare in real-world scenarios? These
questions are the focus of this blog.
+
+<!--truncate-->
+
+Having had the good fortune of working on diverse database projects - an RDBMS
([Oracle](https://www.oracle.com/database/)), a NoSQL key-value store
([Voldemort](https://www.slideshare.net/vinothchandar/voldemort-prototype-to-production-nectar-edits)),
a streaming database
([ksqlDB](https://www.confluent.io/blog/ksqldb-pull-queries-high-availability/)),
a closed-source real-time datastore and of course, Apache Hudi, I can safely
say that the nature of workloads deeply influence the concurrency control
mechanisms adopted in different databases. This blog will also describe how we
rethought concurrency control for the data lake in Apache Hudi.
+
+First, let's set the record straight. RDBMS databases offer the richest set of
transactional capabilities and the widest array of concurrency control
[mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html).
Different isolation levels, fine grained locking, deadlock
detection/avoidance, and more are possible because they have to support
row-level mutations and reads across many tables while enforcing [key
constraints](https://dev.mysql.com/doc/refman/8.0/en/create-table-foreign-keys.html)
and maintaining
[indexes](https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html).
NoSQL stores offer dramatically weaker guarantees like eventual-consistency
and simple row level atomicity in exchange for greater scalability for simpler
workloads. Drawing a similar parallel, traditional data warehouses offer more
or less the full set of capabilities that you would find in an RDBMS, over
columnar data, with locking and key constraints [enforce
d](https://docs.teradata.com/r/a8IdS6iVHR77Z9RrIkmMGg/wFPZS4jwZgSG21GnOIpEsw)
whereas cloud data warehouses seem to have focused a lot more on separating the
data and compute in architecture, while offering fewer isolation levels. As a
surprising example, [no
enforcement](https://docs.snowflake.com/en/sql-reference/constraints-overview.html#supported-constraint-types)
of key constraints!
+
+# Pitfalls in Lake Concurrency Control
+
+Historically, data lakes have been viewed as batch jobs reading/writing files
on cloud storage and it's interesting to see how most new work extends this
view and implements glorified file version control using some form of
"[**Optimistic concurrency
control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)"
(OCC). With OCC jobs take a table level lock to check if they have impacted
overlapping files and if a conflict exists, they abort their operations
completely. Without naming names, the lock is sometimes even just a JVM level
lock held on a single Apache Spark driver node. Once again, this may be okay
for lightweight coordination of old school batch jobs that mostly append files
to tables, but cannot be applied broadly to modern data lake workloads. Such
approaches are built with immutable/append-only data models in mind, which are
inadequate for incremental data processing or keyed updates/deletes. OCC is
very optimistic that real contention never happens. Develo
per evangelism comparing OCC to the full fledged transactional capabilities of
an RDBMS or a traditional data warehouse is rather misinformed. Quoting
Wikipedia directly - "_if contention for data resources is frequent, the cost
of repeatedly restarting transactions hurts performance significantly, in which
case other_ [_concurrency
control_](https://en.wikipedia.org/wiki/Concurrency_control) _methods may be
better suited._ " When conflicts do occur, they can cause massive resource
wastage since you have a batch job that fails after it ran for a few hours,
during every attempt!
+
+Imagine a real-life scenario of two writer processes : an ingest writer job
producing new data every 30 minutes and a deletion writer job that is enforcing
GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap
files with random deletes, and the deletion job is almost guaranteed to starve
and fail to commit each time. In database speak, mixing long running
transactions with optimism leads to disappointment, since the longer the
transactions the higher the probability they will overlap.
+
+
+
static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png
+
+So, what's the alternative? Locking? Wikipedia also says - "_However,
locking-based ("pessimistic") methods also can deliver poor performance because
locking can drastically limit effective concurrency even when deadlocks are
avoided."._ Here is where Hudi takes a different approach, that we believe is
more apt for modern lake transactions which are typically long-running and even
continuous. Data lake workloads share more characteristics with high throughput
stream processing jobs, than they do to standard reads/writes from a database
and this is where we build strength. In stream processing events are serialized
into a single ordered log, avoiding any locks/concurrency bottlenecks and you
can continuously process millions of events/sec. Hudi implements a file level,
log based concurrency control protocol on the Hudi
[timeline](https://hudi.apache.org/docs/timeline), which in-turn relies on bare
minimum atomic puts to cloud storage. By building on an event log as the
central piece
for inter process coordination, Hudi is able to offer a few flexible
deployment models that offer greater concurrency over pure OCC approaches.
Review comment:
`where we build strength from` ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]