[GitHub] [hudi] vinothchandar commented on a diff in pull request #8562: [DOCS] Update HMS lock provider behavior

via GitHub Tue, 25 Apr 2023 06:50:02 -0700


vinothchandar commented on code in PR #8562:
URL: https://github.com/apache/hudi/pull/8562#discussion_r1176546145



##########
website/docs/concurrency_control.md:
##########
@@ -77,17 +77,31 @@ hoodie.write.lock.zookeeper.lock_key
 hoodie.write.lock.zookeeper.base_path
 ```
 
-**`HiveMetastore`** based lock provider
+**Hive Metastore** based lock provider
 
+`HiveMetastoreBasedLockProvider` uses the underlying Hive locks to support 
concurrency control. 
 ```
 
hoodie.write.lock.provider=org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider
-hoodie.write.lock.hivemetastore.database
-hoodie.write.lock.hivemetastore.table
+hoodie.write.lock.hivemetastore.database=test_db
+hoodie.write.lock.hivemetastore.table=test_table
 ```
 
-`The HiveMetastore URI's are picked up from the hadoop configuration file 
loaded during runtime.`
+HiveMetastore URIs, if not explicitly provided, are picked up from the hadoop 
configuration file (`hive-site.xml`)
+loaded during runtime. Note that if Zookeeper is being used as the
+Hive [lock 
manager](https://github.com/apache/hive/blob/954bb49da611b13e689a6922538f54306004c676/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2935)
+, then `hoodie.write.lock.zookeeper.url` and 
`hoodie.write.lock.zookeeper.port` should also be configured to point to
+the Zookeeper instance. If it is already configured in your Hadoop 
configuration (`hdfs-site.xml`), then additional
+configuration is not required.
 
-**`Amazon DynamoDB`** based lock provider
+:::note
+While using `HiveMetastoreBasedLockProvider`, if the pipeline crashed while 
the lock was acquired, Hive does not 
+automatically remove the lock entry from the table. In this case, Hudi writer 
will simply abort due to transaction timeout. 
+One can follow the [debugging 
tips](https://cwiki.apache.org/confluence/display/hive/locking#Locking-Debugging)
 provided 
+by Hive to show currently acquired locks and remove tne entry from the 
`HIVE_LOCKS` table in the underlying RDBMS of 

Review Comment:
   rename: RDBMS -> database



##########
website/docs/concurrency_control.md:
##########
@@ -49,7 +49,7 @@ hoodie.write.lock.provider=<lock-provider-classname>
 
 There are 4 different lock providers that require different configurations to 
be set.
 
-**`FileSystem`** based lock provider
+**FileSystem** based lock provider
 
 FileSystem based lock provider supports multiple writers cross different 
jobs/applications based on atomic create/delete operations of the underlying 
filesystem.

Review Comment:
   lets clarify where this works and where it does not?



##########
website/docs/concurrency_control.md:
##########
@@ -77,17 +77,31 @@ hoodie.write.lock.zookeeper.lock_key
 hoodie.write.lock.zookeeper.base_path
 ```
 
-**`HiveMetastore`** based lock provider
+**Hive Metastore** based lock provider
 
+`HiveMetastoreBasedLockProvider` uses the underlying Hive locks to support 
concurrency control. 
 ```
 
hoodie.write.lock.provider=org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider
-hoodie.write.lock.hivemetastore.database
-hoodie.write.lock.hivemetastore.table
+hoodie.write.lock.hivemetastore.database=test_db
+hoodie.write.lock.hivemetastore.table=test_table
 ```
 
-`The HiveMetastore URI's are picked up from the hadoop configuration file 
loaded during runtime.`
+HiveMetastore URIs, if not explicitly provided, are picked up from the hadoop 
configuration file (`hive-site.xml`)
+loaded during runtime. Note that if Zookeeper is being used as the

Review Comment:
   do we need the link to Hive code? there ain't no locks? 
   



##########
website/docs/concurrency_control.md:
##########
@@ -49,7 +49,7 @@ hoodie.write.lock.provider=<lock-provider-classname>
 
 There are 4 different lock providers that require different configurations to 
be set.
 
-**`FileSystem`** based lock provider
+**FileSystem** based lock provider
 
 FileSystem based lock provider supports multiple writers cross different 
jobs/applications based on atomic create/delete operations of the underlying 
filesystem.

Review Comment:
   Just atomic create operations is not sufficient to avoid TOCTOU issues. One 
writer should fail if file was already created. 



##########
website/docs/concurrency_control.md:
##########
@@ -77,17 +77,31 @@ hoodie.write.lock.zookeeper.lock_key
 hoodie.write.lock.zookeeper.base_path
 ```
 
-**`HiveMetastore`** based lock provider
+**Hive Metastore** based lock provider
 
+`HiveMetastoreBasedLockProvider` uses the underlying Hive locks to support 
concurrency control. 
 ```
 
hoodie.write.lock.provider=org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider
-hoodie.write.lock.hivemetastore.database
-hoodie.write.lock.hivemetastore.table
+hoodie.write.lock.hivemetastore.database=test_db
+hoodie.write.lock.hivemetastore.table=test_table
 ```
 
-`The HiveMetastore URI's are picked up from the hadoop configuration file 
loaded during runtime.`
+HiveMetastore URIs, if not explicitly provided, are picked up from the hadoop 
configuration file (`hive-site.xml`)
+loaded during runtime. Note that if Zookeeper is being used as the
+Hive [lock 
manager](https://github.com/apache/hive/blob/954bb49da611b13e689a6922538f54306004c676/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2935)
+, then `hoodie.write.lock.zookeeper.url` and 
`hoodie.write.lock.zookeeper.port` should also be configured to point to
+the Zookeeper instance. If it is already configured in your Hadoop 
configuration (`hdfs-site.xml`), then additional
+configuration is not required.
 
-**`Amazon DynamoDB`** based lock provider
+:::note
+While using `HiveMetastoreBasedLockProvider`, if the pipeline crashed while 
the lock was acquired, Hive does not 

Review Comment:
   rename:pipeline -> writer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a diff in pull request #8562: [DOCS] Update HMS lock provider behavior

Reply via email to