n3nash commented on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-797797580
@vinothchandar Addressed your feedback comments. Some more things to note: 1. Added explicit dependency on `curator` to control the versions, earlier brought in through transitive dependency. NOTE that the latest stable version for curator is 4.X but we have to use curator version 2.X since 4.X is not backwards compatible with Zookeeper 2.x and 3.X while curator 2.X is backwards and forward compatible. The latest stable version of curator brings in many new features which are not applicable to us, so it should be OK to use 2.X 2. Added Spark Datasource based nodes to HoodieTestSuite and added a test case to run through optimistic locking enabled with Spark datasource. NOTE that we can only test using `ZookeeperBasedLockProvider` for tests and cannot use `HiveMetastoreBasedLockProvider`. This is because embedded hive metastore only allows one Hive client. When we use spark, spark's HiveExternalCatalog already creates an internal hive client, when hudi tries to create a new hive client through the `HiveMetastoreBasedLockProvider`, it throws underlying derby related exceptions. Some workarounds are to not use derby and are mentioned here -> https://issues.apache.org/jira/browse/HIVE-21302 but this requires a change to the underlying choice of DB for hive metastore itself. Hence, for now, we can only use ZK locks for in-memory unit tests. Ran it against production setup of HiveMetastore to validate this is only a unit test problem. 3. Added TestSimpleConcurrentFileWritesConflictResolutionStrategy to ensure the right candidate streams are always used. 4. Added another section to quickstart on how to enable optimistic locks. This PR -> https://github.com/apache/hudi/pull/2660 has the details. Validated this by turning on INFO logs on docker to confirm that locking works. Some warnings ( I will add these to the docs when I send the PR for documentation) 1. `DeltaStreamer` cannot work with multi-writer. Currently we copy checkpoints from the previous commit to the next - this breaks when we perform multi-writing. 2. Configs for retries and timeouts for locking can vary because of many factors : response times of file system APIs, number of commits to check conflict resolution against etc. We need to document that folks have to choose sane timeouts and retries for their use-case if defaults don't work. 3. Incremental pull using TimelineAPI's will not work for clients. There will be a follow up PR for this in progress. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
