[GitHub] [hudi] atharvai opened a new issue, #6226: [SUPPORT] OCC locks with data on S3 are held for too long

GitBox Wed, 27 Jul 2022 09:27:40 -0700


atharvai opened a new issue, #6226:
URL: https://github.com/apache/hudi/issues/6226


   **Describe the problem you faced**
   
   I’m running several workloads in production and one is a parallel add oof 
partitions to a COW hudi table. I’m managing OCC with DynamoDB and partition in 
dynamoo is the table name. I’m finding that each paraallel instance waits for a 
lock and is blocked even though partitions being updated are different. Now 
this compounds as the number of parallel writes/jobs increase and you see 
things like the screenshot where each subsequent job takes 1 minute more as it 
is blocked on a lock. (spark job takes about 1-2min to run and then waits on 
lock until previous job completes, so majority of the 1hr duration is just 
waiting for lock.
   
   First question: is this designed/intended behaviour?
   Second question: should I be using table partition key as lock partition 
key? currently, as per docs we use table name only, not table partition for 
lock.
   
   env: hudi v0.11 EMR 6.6.0 Spark 3.2.0
   
   
![image](https://user-images.githubusercontent.com/14027470/181298670-fcd51d0a-ba1e-4b24-ae04-932c9dc09615.png)
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. write a spark job with OCC enabled to write data to table on S3
   2. run multiple instances of job with different data being ingested in 
**different** partitions, could be append only to new partitions
   3. the more concurrent jobs and more data you have the longer the locks are 
held by each job and newer jobs are waiting for locks
   
   
   **Expected behavior**
   
   lock should be held for a short time if ingestion affects unrelated 
partitions. 
   
   **Environment Description**
   
   * Hudi version :0.11.0
   
   * Spark version : 3.2.0
   
   * Hive version : 3.1.2
   
   * Hadoop version :3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no ([AWS EMR 
6.6.0](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-660-release.html))
   
   
   **Additional context**
   
   Why would lock object be null? The default timeouts is 60s, but this seems 
to happen after 20min sometimes. Or sometimes after 1hr
   
   **Stacktrace**
   
   ```
   ERROR Client: Application diagnostics message: User class threw exception: 
org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock 
object null
        at 
org.apache.hudi.client.transaction.lock.LockManager.lock(LockManager.java:82)
        at 
org.apache.hudi.client.transaction.TransactionManager.beginTransaction(TransactionManager.java:53)
        at 
org.apache.hudi.client.BaseHoodieWriteClient.commitStats(BaseHoodieWriteClient.java:230)
        at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:122)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:650)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:313)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:163)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] atharvai opened a new issue, #6226: [SUPPORT] OCC locks with data on S3 are held for too long

Reply via email to