kywe665 commented on a change in pull request #4344: URL: https://github.com/apache/hudi/pull/4344#discussion_r770997494
########## File path: website/blog/2021-12-16-lakehouse-concurrency-control-are-we-too-optimistic.md ########## @@ -0,0 +1,53 @@ +--- +title: "Lakehouse Concurrency Control: Are we too optimistic?" +excerpt: "Vinoth Chandar, Creator of Apache Hudi, dives into concurrency control mechanisms" +author: Vinoth Chandar +category: blog +--- + +Transactions on data lakes are now considered a key characteristic of a Lakehouse these days. But what has actually been accomplished so far? What are the current approaches? How do they fare in real-world scenarios? These questions are the focus of this blog. + +<!--truncate--> + +Having had the good fortune of working on diverse database projects - an RDBMS ([Oracle](https://www.oracle.com/database/)), a NoSQL key-value store ([Voldemort](https://www.slideshare.net/vinothchandar/voldemort-prototype-to-production-nectar-edits)), a streaming database ([ksqlDB](https://www.confluent.io/blog/ksqldb-pull-queries-high-availability/)), a closed-source real-time datastore and of course, Apache Hudi, I can safely say that the nature of workloads deeply influence the concurrency control mechanisms adopted in different databases. This blog will also describe how we rethought concurrency control for the data lake in Apache Hudi. + +First, let's set the record straight. RDBMS databases offer the richest set of transactional capabilities and the widest array of concurrency control [mechanisms](https://dev.mysql.com/doc/refman/5.7/en/innodb-locking-transaction-model.html). Different isolation levels, fine grained locking, deadlock detection/avoidance, and more are possible because they have to support row-level mutations and reads across many tables while enforcing [key constraints](https://dev.mysql.com/doc/refman/8.0/en/create-table-foreign-keys.html) and maintaining [indexes](https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html). NoSQL stores offer dramatically weaker guarantees like eventual-consistency and simple row level atomicity in exchange for greater scalability for simpler workloads. Drawing a similar parallel, traditional data warehouses offer more or less the full set of capabilities that you would find in an RDBMS, over columnar data, with locking and key constraints [enforce d](https://docs.teradata.com/r/a8IdS6iVHR77Z9RrIkmMGg/wFPZS4jwZgSG21GnOIpEsw) whereas cloud data warehouses seem to have focused a lot more on separating the data and compute in architecture, while offering fewer isolation levels. As a surprising example, [no enforcement](https://docs.snowflake.com/en/sql-reference/constraints-overview.html#supported-constraint-types) of key constraints! + +# Pitfalls in Lake Concurrency Control + +Historically, data lakes have been viewed as batch jobs reading/writing files on cloud storage and it's interesting to see how most new work extends this view and implements glorified file version control using some form of "[**Optimistic concurrency control**](https://en.wikipedia.org/wiki/Optimistic_concurrency_control)" (OCC). With OCC jobs take a table level lock to check if they have impacted overlapping files and if a conflict exists, they abort their operations completely. Without naming names, the lock is sometimes even just a JVM level lock held on a single Apache Spark driver node. Once again, this may be okay for lightweight coordination of old school batch jobs that mostly append files to tables, but cannot be applied broadly to modern data lake workloads. Such approaches are built with immutable/append-only data models in mind, which are inadequate for incremental data processing or keyed updates/deletes. OCC is very optimistic that real contention never happens. Develo per evangelism comparing OCC to the full fledged transactional capabilities of an RDBMS or a traditional data warehouse is rather misinformed. Quoting Wikipedia directly - "_if contention for data resources is frequent, the cost of repeatedly restarting transactions hurts performance significantly, in which case other_ [_concurrency control_](https://en.wikipedia.org/wiki/Concurrency_control) _methods may be better suited._ " When conflicts do occur, they can cause massive resource wastage since you have a batch job that fails after it ran for a few hours, during every attempt! + +Imagine a real-life scenario of two writer processes : an ingest writer job producing new data every 30 minutes and a deletion writer job that is enforcing GDPR, taking 2 hours to issue deletes. It's very likely for these to overlap files with random deletes, and the deletion job is almost guaranteed to starve and fail to commit each time. In database speak, mixing long running transactions with optimism leads to disappointment, since the longer the transactions the higher the probability they will overlap. + + + static/assets/images/blog/concurrency/ConcurrencyControlConflicts.png + +So, what's the alternative? Locking? Wikipedia also says - "_However, locking-based ("pessimistic") methods also can deliver poor performance because locking can drastically limit effective concurrency even when deadlocks are avoided."._ Here is where Hudi takes a different approach, that we believe is more apt for modern lake transactions which are typically long-running and even continuous. Data lake workloads share more characteristics with high throughput stream processing jobs, than they do to standard reads/writes from a database and this is where we build strength. In stream processing events are serialized into a single ordered log, avoiding any locks/concurrency bottlenecks and you can continuously process millions of events/sec. Hudi implements a file level, log based concurrency control protocol on the Hudi [timeline](https://hudi.apache.org/docs/timeline), which in-turn relies on bare minimum atomic puts to cloud storage. By building on an event log as the central piece for inter process coordination, Hudi is able to offer a few flexible deployment models that offer greater concurrency over pure OCC approaches. Review comment: Cool, I reverted strength back to where it was and added the track table snapshots -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
