vinothchandar commented on code in PR #8562: URL: https://github.com/apache/hudi/pull/8562#discussion_r1176546145
########## website/docs/concurrency_control.md: ########## @@ -77,17 +77,31 @@ hoodie.write.lock.zookeeper.lock_key hoodie.write.lock.zookeeper.base_path ``` -**`HiveMetastore`** based lock provider +**Hive Metastore** based lock provider +`HiveMetastoreBasedLockProvider` uses the underlying Hive locks to support concurrency control. ``` hoodie.write.lock.provider=org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider -hoodie.write.lock.hivemetastore.database -hoodie.write.lock.hivemetastore.table +hoodie.write.lock.hivemetastore.database=test_db +hoodie.write.lock.hivemetastore.table=test_table ``` -`The HiveMetastore URI's are picked up from the hadoop configuration file loaded during runtime.` +HiveMetastore URIs, if not explicitly provided, are picked up from the hadoop configuration file (`hive-site.xml`) +loaded during runtime. Note that if Zookeeper is being used as the +Hive [lock manager](https://github.com/apache/hive/blob/954bb49da611b13e689a6922538f54306004c676/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2935) +, then `hoodie.write.lock.zookeeper.url` and `hoodie.write.lock.zookeeper.port` should also be configured to point to +the Zookeeper instance. If it is already configured in your Hadoop configuration (`hdfs-site.xml`), then additional +configuration is not required. -**`Amazon DynamoDB`** based lock provider +:::note +While using `HiveMetastoreBasedLockProvider`, if the pipeline crashed while the lock was acquired, Hive does not +automatically remove the lock entry from the table. In this case, Hudi writer will simply abort due to transaction timeout. +One can follow the [debugging tips](https://cwiki.apache.org/confluence/display/hive/locking#Locking-Debugging) provided +by Hive to show currently acquired locks and remove tne entry from the `HIVE_LOCKS` table in the underlying RDBMS of Review Comment: rename: RDBMS -> database ########## website/docs/concurrency_control.md: ########## @@ -49,7 +49,7 @@ hoodie.write.lock.provider=<lock-provider-classname> There are 4 different lock providers that require different configurations to be set. -**`FileSystem`** based lock provider +**FileSystem** based lock provider FileSystem based lock provider supports multiple writers cross different jobs/applications based on atomic create/delete operations of the underlying filesystem. Review Comment: lets clarify where this works and where it does not? ########## website/docs/concurrency_control.md: ########## @@ -77,17 +77,31 @@ hoodie.write.lock.zookeeper.lock_key hoodie.write.lock.zookeeper.base_path ``` -**`HiveMetastore`** based lock provider +**Hive Metastore** based lock provider +`HiveMetastoreBasedLockProvider` uses the underlying Hive locks to support concurrency control. ``` hoodie.write.lock.provider=org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider -hoodie.write.lock.hivemetastore.database -hoodie.write.lock.hivemetastore.table +hoodie.write.lock.hivemetastore.database=test_db +hoodie.write.lock.hivemetastore.table=test_table ``` -`The HiveMetastore URI's are picked up from the hadoop configuration file loaded during runtime.` +HiveMetastore URIs, if not explicitly provided, are picked up from the hadoop configuration file (`hive-site.xml`) +loaded during runtime. Note that if Zookeeper is being used as the Review Comment: do we need the link to Hive code? there ain't no locks? ########## website/docs/concurrency_control.md: ########## @@ -49,7 +49,7 @@ hoodie.write.lock.provider=<lock-provider-classname> There are 4 different lock providers that require different configurations to be set. -**`FileSystem`** based lock provider +**FileSystem** based lock provider FileSystem based lock provider supports multiple writers cross different jobs/applications based on atomic create/delete operations of the underlying filesystem. Review Comment: Just atomic create operations is not sufficient to avoid TOCTOU issues. One writer should fail if file was already created. ########## website/docs/concurrency_control.md: ########## @@ -77,17 +77,31 @@ hoodie.write.lock.zookeeper.lock_key hoodie.write.lock.zookeeper.base_path ``` -**`HiveMetastore`** based lock provider +**Hive Metastore** based lock provider +`HiveMetastoreBasedLockProvider` uses the underlying Hive locks to support concurrency control. ``` hoodie.write.lock.provider=org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider -hoodie.write.lock.hivemetastore.database -hoodie.write.lock.hivemetastore.table +hoodie.write.lock.hivemetastore.database=test_db +hoodie.write.lock.hivemetastore.table=test_table ``` -`The HiveMetastore URI's are picked up from the hadoop configuration file loaded during runtime.` +HiveMetastore URIs, if not explicitly provided, are picked up from the hadoop configuration file (`hive-site.xml`) +loaded during runtime. Note that if Zookeeper is being used as the +Hive [lock manager](https://github.com/apache/hive/blob/954bb49da611b13e689a6922538f54306004c676/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2935) +, then `hoodie.write.lock.zookeeper.url` and `hoodie.write.lock.zookeeper.port` should also be configured to point to +the Zookeeper instance. If it is already configured in your Hadoop configuration (`hdfs-site.xml`), then additional +configuration is not required. -**`Amazon DynamoDB`** based lock provider +:::note +While using `HiveMetastoreBasedLockProvider`, if the pipeline crashed while the lock was acquired, Hive does not Review Comment: rename:pipeline -> writer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
