hemanth-gowda-12 opened a new issue, #7654:
URL: https://github.com/apache/hudi/issues/7654

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   Trying to replicate a distributed system via a test running Hudi Java Client 
in OCC mode.
   
[link](https://github.com/hemanth-gowda-12/ApacheHudiOccTest/blob/main/occ/src/test/java/org/example/HudiOccTest.java)
   
   Running into a scenario where there is starvation waiting for locks just 
using 3 writers to mimic 3 distributed machines. The performance doesn't seem 
practical the way I'm testing it. Trying to understand how to optimize. 
   The starvation exists when using both the ZooKeeper and FS lock providers 
but it more prominent on ZK since there are multiple requests for locks which 
results in infinite starvation.
   
   TLDR; Run the below test, after a few writes, the client goes into a 
starvation phase and remains idle doing no work and eventually failing with the 
below exception 
   `org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock 
object`   
   
   **To Reproduce**
   Run the test 
[here](https://github.com/hemanth-gowda-12/ApacheHudiOccTest/blob/main/occ/src/test/java/org/example/HudiOccTest.java)
 and look at the logs and the occ/tmp/hudiTest dir for the test table.
   
   Steps to reproduce the behavior:
   1. Just run the test to reproduce the starvation using FS lock proviser.
   2. To reproduce Zookeeper starvation scenario, comment line 151-156 and 
Uncomment lines 160-168
   3. Delete the occ/tmp directory and re-run the test
   4. Install Docker and run `docker run -d  --name zookeeper  -p 2181:2181  
jplock/zookeeper`
   5. The test will hang due to starvation after a few seconds of running. You 
can inspect the Zookeeper locks being held un-released as shown below.
   6. Download Zookeeper client and do `sh 
/opt/zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:218`
   7. After the client connects, do `ls /test/test_table`
   
   **Expected behavior**
   Test completes with reasonable performance - The test generates records with 
keys with range 0-99 10 times. Each partition should have 1 insert and 9 
updates happening in parallel.
   
   A clear and concise description of what you expected to happen.
   
   OCC mode having reasonable performance using the Java Client to support high 
throughput writes/updates.
   
   **Environment Description**
   
   * Hudi version : 0.12.2
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : Local FS
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   Client runs for a while and then starves at log point
   `2023-01-12 00:59:03,814 [INFO  ] ConnectionStateManager - State change: 
CONNECTED
   2023-01-12 00:59:09,199 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 00:59:09,739 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:00:04,821 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:00:10,215 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:00:10,756 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:01:05,839 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:01:11,235 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:01:11,771 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:02:06,856 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:02:12,255 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:02:12,789 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:03:07,875 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:03:13,272 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table
   2023-01-12 01:03:13,802 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock 
atZkBasePath = /test, lock key = test_table`
   
   It eventually fails with an error
   `org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock 
object `
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to