hemanth-gowda-12 opened a new issue, #7654: URL: https://github.com/apache/hudi/issues/7654
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Trying to replicate a distributed system via a test running Hudi Java Client in OCC mode. [link](https://github.com/hemanth-gowda-12/ApacheHudiOccTest/blob/main/occ/src/test/java/org/example/HudiOccTest.java) Running into a scenario where there is starvation waiting for locks just using 3 writers to mimic 3 distributed machines. The performance doesn't seem practical the way I'm testing it. Trying to understand how to optimize. The starvation exists when using both the ZooKeeper and FS lock providers but it more prominent on ZK since there are multiple requests for locks which results in infinite starvation. TLDR; Run the below test, after a few writes, the client goes into a starvation phase and remains idle doing no work and eventually failing with the below exception `org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object` **To Reproduce** Run the test [here](https://github.com/hemanth-gowda-12/ApacheHudiOccTest/blob/main/occ/src/test/java/org/example/HudiOccTest.java) and look at the logs and the occ/tmp/hudiTest dir for the test table. Steps to reproduce the behavior: 1. Just run the test to reproduce the starvation using FS lock proviser. 2. To reproduce Zookeeper starvation scenario, comment line 151-156 and Uncomment lines 160-168 3. Delete the occ/tmp directory and re-run the test 4. Install Docker and run `docker run -d --name zookeeper -p 2181:2181 jplock/zookeeper` 5. The test will hang due to starvation after a few seconds of running. You can inspect the Zookeeper locks being held un-released as shown below. 6. Download Zookeeper client and do `sh /opt/zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:218` 7. After the client connects, do `ls /test/test_table` **Expected behavior** Test completes with reasonable performance - The test generates records with keys with range 0-99 10 times. Each partition should have 1 insert and 9 updates happening in parallel. A clear and concise description of what you expected to happen. OCC mode having reasonable performance using the Java Client to support high throughput writes/updates. **Environment Description** * Hudi version : 0.12.2 * Spark version : * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : Local FS * Running on Docker? (yes/no) : No **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` Client runs for a while and then starves at log point `2023-01-12 00:59:03,814 [INFO ] ConnectionStateManager - State change: CONNECTED 2023-01-12 00:59:09,199 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 00:59:09,739 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:00:04,821 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:00:10,215 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:00:10,756 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:01:05,839 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:01:11,235 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:01:11,771 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:02:06,856 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:02:12,255 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:02:12,789 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:03:07,875 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:03:13,272 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table 2023-01-12 01:03:13,802 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table` It eventually fails with an error `org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
