[GitHub] [hudi] yihua commented on a diff in pull request #7912: [HUDI-5739] Review Hudi Config Defaults

via GitHub Tue, 07 Mar 2023 18:12:13 -0800


yihua commented on code in PR #7912:
URL: https://github.com/apache/hudi/pull/7912#discussion_r1128765984



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -93,6 +98,7 @@ public class HoodieArchivalConfig extends HoodieConfig {
       .withDocumentation("When enable, hoodie will auto merge several small 
archive files into larger one. It's"
           + " useful when storage scheme doesn't support append operation.");
 
+  // Ethan: Once this is well tested, we should remove the feature flag

Review Comment:
   @nsivabalan could we remove this config now?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback 
of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily 
by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -108,6 +110,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Partitions to run clustering");
 
+  // Ethan: enum instead of class name to determine strategy type?  So this 
can be engine-agnostic

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback 
of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily 
by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?
   public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE = 
ConfigProperty
       .key("hoodie.cleaner.parallelism")
       .defaultValue("200")
       .withDocumentation("Parallelism for the cleaning operation. Increase 
this if cleaning becomes slow.");
 
+  // Ethan: wondering if this is still needed.

Review Comment:
   @nsivabalan do you think we still need this?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -201,6 +210,8 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Determines how to handle updates, deletes to file 
groups that are under clustering."
           + " Default strategy just rejects the update");
 
+  // Ethan: these configs of inline, async, schedule, execute can be merged 
into one mode config:

Review Comment:
   Tracked in HUDI-5896



##########
hudi-aws/src/main/java/org/apache/hudi/config/HoodieAWSConfig.java:
##########
@@ -41,20 +42,23 @@
  */
 @Immutable
 @ConfigClassProperty(name = "Amazon Web Services Configs",
-        groupName = ConfigGroups.Names.AWS,
-        description = "Amazon Web Services configurations to access resources 
like Amazon DynamoDB (for locks),"
-            + " Amazon CloudWatch (metrics).")
+    groupName = ConfigGroups.Names.AWS,
+    description = "Amazon Web Services configurations to access resources like 
Amazon DynamoDB (for locks),"
+        + " Amazon CloudWatch (metrics).")
 public class HoodieAWSConfig extends HoodieConfig {
+  // AWS Access Configs
+
+  // Ethan: These configs should only be necessary if no environmental 
variables are not set (to confirm with code)

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -196,6 +222,7 @@ public class HoodieCompactionConfig extends HoodieConfig {
       .withDocumentation("New optimized scan for log blocks that handles all 
multi-writer use-cases while appending to log files. "
           + "It also differentiates original blocks written by ingestion 
writers and compacted blocks written log compaction.");
 
+  // Ethan: These deprecated configs should be removed.  Similar in other 
classes

Review Comment:
   Leave these alone for 0.x.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Class to use for reading the bootstrap dataset 
partitions/files, for Bootstrap mode FULL_RECORD");
 
+  // Ethan: can this be aligned with key gen for write (no need for a separate 
bootstrap key gen config)?
   public static final ConfigProperty<String> KEYGEN_CLASS_NAME = ConfigProperty
       .key("hoodie.bootstrap.keygen.class")
       .noDefaultValue()
       .sinceVersion("0.6.0")
       .withDocumentation("Key generator implementation to be used for 
generating keys from the bootstrapped dataset");
 
+  // Ethan: similar here
   public static final ConfigProperty<String> KEYGEN_TYPE = ConfigProperty
       .key("hoodie.bootstrap.keygen.type")
       .defaultValue(KeyGeneratorType.SIMPLE.name())
       .sinceVersion("0.9.0")
       .withDocumentation("Type of build-in key generator, currently support 
SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE");
 
+  // Ethan: better default translator: if seeing hive-style partitioning, 
automatically parse the partition value?
   public static final ConfigProperty<String> 
PARTITION_PATH_TRANSLATOR_CLASS_NAME = ConfigProperty
       .key("hoodie.bootstrap.partitionpath.translator.class")
       .defaultValue(IdentityBootstrapPartitionPathTranslator.class.getName())
       .sinceVersion("0.6.0")
       .withDocumentation("Translates the partition paths from the bootstrapped 
data into how is laid out as a Hudi table.");
 
+  // Ethan: can this be aligned with Spark parallelism?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -259,6 +263,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "Consistent hashing supports dynamic resizing of the number of 
bucket, solving potential data skew and file size "
           + "issues of the SIMPLE hashing engine. Consistent hashing only 
works with MOR tables, only use simple hashing on COW tables.");
 
+  // Ethan: is the default too large?

Review Comment:
   @yuzhaojing @minihippo could you confirm if the default is too large for 
medium-sized table?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -77,6 +82,7 @@ public class HoodieMemoryConfig extends HoodieConfig {
       .defaultValue(16 * 1024 * 1024)
       .withDocumentation("Property to control the max memory in bytes for dfs 
input stream buffer size");
 
+  // Ethan: auto configured using table base path?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -135,6 +141,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("For Hive based lock provider, the Hive metastore URI 
to acquire locks against.");
 
+  // Ethan: set this automatically based on the table base path?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -279,6 +288,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .defaultValue("0")
       .withDocumentation("Parallelism used for “delete” operation. Delete 
operations also performs shuffles, similar to upsert operation.");
 
+  // Ethan: does this need to be automatically set based on the input like 
other parallelism?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteCommitCallbackConfig.java:
##########
@@ -65,6 +66,7 @@ public class HoodieWriteCommitCallbackConfig extends 
HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Http callback API key. 
hudi_write_commit_http_callback by default");
 
+  // Ethan: default is too low? 30s instead?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -329,18 +341,22 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Write status objects hold metadata about a write 
(stats, errors), that is not yet committed to storage. "
           + "This controls the how that information is cached for inspection 
by clients. We rarely expect this to be changed.");
 
+  // Ethan: this should be an internal config.
   public static final ConfigProperty<String> AUTO_COMMIT_ENABLE = 
ConfigProperty
       .key("hoodie.auto.commit")
       .defaultValue("true")
       .withDocumentation("Controls whether a write operation should auto 
commit. This can be turned off to perform inspection"
           + " of the uncommitted write before deciding to commit.");
 
+  // Ethan: we never talk about this in docs :)
   public static final ConfigProperty<String> WRITE_STATUS_CLASS_NAME = 
ConfigProperty
       .key("hoodie.writestatus.class")
       .defaultValue(WriteStatus.class.getName())
       .withDocumentation("Subclass of " + WriteStatus.class.getName() + " to 
be used to collect information about a write. Can be "
           + "overridden to collection additional metrics/statistics about the 
data if needed.");
 
+  // Ethan: similarly, we should revisit all parallelism and see if they can 
be auto-configured
+  // instead of a static default.

Review Comment:
   Tacked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -289,6 +299,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .defaultValue(String.valueOf(4 * 1024 * 1024))
       .withDocumentation("Size of in-memory buffer used for parallelizing 
network reads and lake storage writes.");
 
+  // Ethan: is 1024B too small as the default?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -501,6 +524,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on 
the table and exactly one of them succeed "
           + "if a conflict (writes affect the same file group) is detected.");
 
+  // Ethan: need to check why we need both "hoodie.avro.schema" and 
"hoodie.write.schema".

Review Comment:
   Will leave the schema configs alone for now.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the 
spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?
   public static final ConfigProperty<Boolean> AUTO_ADJUST_LOCK_CONFIGS = 
ConfigProperty
       .key("hoodie.auto.adjust.lock.configs")
       .defaultValue(false)
       .sinceVersion("0.11.0")
       .withDocumentation("Auto adjust lock configurations when metadata table 
is enabled and for async table services.");
 
+  // Ethan: should we remove this and make the validation mandatory?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -375,6 +391,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.9.0")
       .withDocumentation("The batch interval in milliseconds for marker 
creation batch processing");
 
+  // Ethan: similar here for revisiting parallelism

Review Comment:
   Tracked in HUDI-5894



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -40,6 +40,7 @@
         + "This table maintains the metadata about a given Hudi table (e.g 
file listings) "
         + " to avoid overhead of accessing cloud storage, during queries.")
 public final class HoodieMetadataConfig extends HoodieConfig {
+  //cfg

Review Comment:
   Tacked in HUDI-5900



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -162,6 +163,7 @@ public class HoodieTableConfig extends HoodieConfig {
       .noDefaultValue()
       .withDocumentation("Version of timeline used, by the table.");
 
+  // Ethan: to be merged with write payload class

Review Comment:
   Tracked in HUDI-5892



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity 
units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity 
units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?
   public static final ConfigProperty<String> 
DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT = ConfigProperty
       .key(DYNAMODB_BASED_LOCK_PROPERTY_PREFIX + "table_creation_timeout")
       .defaultValue(String.valueOf(10 * 60 * 1000))
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, the maximum number 
of milliseconds to wait for creating DynamoDB table");
 
+  // Ethan: Move the infer logic here

Review Comment:
   Tracked in HUDI-5892



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -40,10 +40,13 @@
         + " between writers to a Hudi table. Concurrency between Hudi's own 
table services "
         + " are auto managed internally.")
 public class DynamoDbBasedLockConfig extends HoodieConfig {
+  // Config for DynamoDB based lock provider
 
   // configs for DynamoDb based locks
   public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX 
+ "dynamodb.";
 
+  // Ethan: Can we provide a default value here that is less likely to collide,
+  // So this config also becomes optional?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -78,6 +81,7 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .withDocumentation("For DynamoDB based lock provider, the region used in 
endpoint for Amazon DynamoDB service."
                          + " Would try to first get it from AWS_REGION 
environment variable. If not find, by default use us-east-1");
 
+  // Ethan: Need to confirm these are good defaults for cost

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?
   public static final ConfigProperty<String> AUTO_ARCHIVE = ConfigProperty
       .key("hoodie.archive.automatic")
       .defaultValue("true")
       .withDocumentation("When enabled, the archival table service is invoked 
immediately after each commit,"
           + " to archive commits if we cross a maximum value of commits."
           + " It's recommended to enable this, to ensure number of active 
commits is bounded.");
 
+  // Ethan: Is Async archive tested?

Review Comment:
   Essential/advanced config tracked in HUDI-5893. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size 
grows."
           + " This config controls the maximum number of instants to retain in 
the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove 
this?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?

Review Comment:
   Tracked in HUDI-5893



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size 
grows."
           + " This config controls the maximum number of instants to retain in 
the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove 
this?
   public static final ConfigProperty<Integer> 
DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.archive.delete.parallelism")
       .defaultValue(100)
       .withDocumentation("Parallelism for deleting archived hoodie commits.");
 
+  // Ethan: pick one of "hoodie.keep.min.commits" and 
"hoodie.keep.max.commits" only?

Review Comment:
   Agree.  We can leave this as is.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Class to use for reading the bootstrap dataset 
partitions/files, for Bootstrap mode FULL_RECORD");
 
+  // Ethan: can this be aligned with key gen for write (no need for a separate 
bootstrap key gen config)?

Review Comment:
   Tracked in HUDI-5895



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -58,12 +61,16 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Only applies when " + AUTO_CLEAN.key() + " is turned 
on. "
           + "When turned on runs cleaner async with writing, which can speed 
up overall write performance.");
 
+  // Ethan: when either "hoodie.cleaner.commits.retained", 
"hoodie.cleaner.hours.retained",

Review Comment:
   Yes.  Tracked in HUDI-5892.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -84,11 +91,13 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Controls how cleaning is scheduled. Valid options: "
           + 
Arrays.stream(CleaningTriggerStrategy.values()).map(Enum::name).collect(Collectors.joining(",")));
 
+  // Ethan: "hoodie.clean.trigger.max.commits" is what this means

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -46,6 +46,7 @@
         + "The bootstrap operation can flexibly avoid copying data over before 
you can use Hudi and support running the existing "
         + " writers and new hudi writers in parallel, to validate the 
migration.")
 public class HoodieBootstrapConfig extends HoodieConfig {
+  // Configs for Bootstrap feature

Review Comment:
   Tracked in HUDI-5895



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -41,10 +42,12 @@
  */
 @Immutable
 @ConfigClassProperty(name = "Clean Configs",
-        groupName = ConfigGroups.Names.WRITE_CLIENT,
-        description = "Cleaning (reclamation of older/unused file 
groups/slices).")
+    groupName = ConfigGroups.Names.WRITE_CLIENT,
+    description = "Cleaning (reclamation of older/unused file groups/slices).")
 public class HoodieCleanConfig extends HoodieConfig {
+  // Configs for Clean action
 
+  // Ethan: should we standardize on `.enable` instead of `.automatic`?

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback 
of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily 
by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?
   public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE = 
ConfigProperty
       .key("hoodie.cleaner.parallelism")
       .defaultValue("200")
       .withDocumentation("Parallelism for the cleaning operation. Increase 
this if cleaning becomes slow.");
 
+  // Ethan: wondering if this is still needed.
   public static final ConfigProperty<Boolean> ALLOW_MULTIPLE_CLEANS = 
ConfigProperty
       .key("hoodie.clean.allow.multiple")
       .defaultValue(true)
       .sinceVersion("0.11.0")
       .withDocumentation("Allows scheduling/executing multiple cleans by 
enabling this config. If users prefer to strictly ensure clean requests should 
be mutually exclusive, "
           + ".i.e. a 2nd clean will not be scheduled if another clean is not 
yet completed to avoid repeat cleaning of same files, they might want to 
disable this config.");
 
+  // Ethan: should the default behavior be deleting original parquet files and 
then we can remove this config?

Review Comment:
   Tracked in HUDI-5895



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour 
to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition 
of which index is divided by 24 and the remainder is equal to the current 
hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config 
to indicate the purpose:

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -131,18 +134,21 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Turn on inline clustering - clustering will be run 
after each write operation is complete")
       .withAlternatives("hoodie.datasource.clustering.inline.enable");
 
+  // Ethan: this can apply to both inline or aysnc clustering.  Rename it for 
better readability.

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -47,6 +47,7 @@
     description = "Configurations that control the clustering table service in 
hudi, "
         + "which optimizes the storage layout for better query performance by 
sorting and sizing data files.")
 public class HoodieClusteringConfig extends HoodieConfig {
+  // Configs for Clustering operations

Review Comment:
   Tracked in HUDI-5897



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -218,6 +229,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Enable running of clustering service, asynchronously 
as inserts happen on the table.")
       .withAlternatives("hoodie.datasource.clustering.async.enable");
 
+  // Ethan: this should be removed.  User should not tweak this to corrupt the 
hoodie_commit_time meta field.

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be 
straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to 
schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute 
compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other
+  // hoodie.compact.schedule.inline
+  // hoodie.compact.inline
+  // hoodie.datasource.compaction.async.enable
+  // // these are reused for inline/async
+  // hoodie.compact.inline.trigger.strategy
+  // hoodie.compact.inline.max.delta.seconds
+  // hoodie.compact.inline.max.delta.commits
   public static final ConfigProperty<String> INLINE_COMPACT = ConfigProperty
       .key("hoodie.compact.inline")
       .defaultValue("false")

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies if index type is BUCKET. Determine the 
number of buckets in the hudi table, "
           + "and each partition is divided to N buckets.");
 
+  // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets": 
e.g., min=0.5*num, max=4*num?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be 
straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to 
schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute 
compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other

Review Comment:
   Tracked in HUDI-5898



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -175,6 +198,9 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + " record sizes. It's recommended to keep this turned on, since 
hand tuning is otherwise extremely"
           + " cumbersome.");
 
+  // Ethan: Should this be removed?  This should only be used by the first 
batch.  For first batch,

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies if index type is BUCKET. Determine the 
number of buckets in the hudi table, "
           + "and each partition is divided to N buckets.");
 
+  // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets": 
e.g., min=0.5*num, max=4*num?

Review Comment:
   @yuzhaojing @minihippo any suggestion on the default value here?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -132,13 +151,15 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "compaction during each compaction run. By default. Hudi picks the 
log file "
           + "with most accumulated unmerged data");
 
+  // Ethan: this seems to be internal
   public static final ConfigProperty<String> COMPACTION_LAZY_BLOCK_READ_ENABLE 
= ConfigProperty
       .key("hoodie.compaction.lazy.block.read")
       .defaultValue("true")
       .withDocumentation("When merging the delta log files, this config helps 
to choose whether the log blocks "
           + "should be read lazily or not. Choose true to use lazy block 
reading (low memory usage, but incurs seeks to each block"
           + " header) or false for immediate block read (higher memory 
usage)");
 
+  // Ethan: this seems to be internal

Review Comment:
   The reverse log traversal is introduced by #331.  @n3nash could you provide 
context around this?  Is it still useful?  How stable is it based on any 
previous production rollout?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieHBaseIndexConfig.java:
##########
@@ -36,6 +36,7 @@
         + "(when HBase based indexing is enabled), which tags incoming "
         + "records as either inserts or updates to older records.")
 public class HoodieHBaseIndexConfig extends HoodieConfig {
+  // Configs for HBase Index

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -177,6 +179,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "When true, the incoming writes will cached to speed up index 
lookup by reducing IO "
           + "for computing parallelism or affected partitions");
 
+  // Ethan: should the parallelim below be dynamically determined by Spark?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
  * Configs/params used for internal purposes.
  */
 public class HoodieInternalConfig extends HoodieConfig {
+  // Some internal configs (?)
 
   private static final long serialVersionUID = 0L;
 
+  // Ethan: why do we need this? BulkInsertPartitioner can already indicate 
`arePartitionRecordsSorted()`
   public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = 
"hoodie.bulkinsert.are.partitioner.records.sorted";
   public static final Boolean 
DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
 
+  // Ethan: why do we need this?

Review Comment:
   @nsivabalan I see this config is introduced by #2328 for bulk insert which 
you reviewed.  Do you know if this is still needed?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java:
##########
@@ -38,13 +38,15 @@
     groupName = ConfigGroups.Names.SPARK_DATASOURCE,
     description = "The following set of configurations help validate new data 
before commits.")
 public class HoodiePreCommitValidatorConfig extends HoodieConfig {
+  // Configs for precommit validator
 
   public static final ConfigProperty<String> VALIDATOR_CLASS_NAMES = 
ConfigProperty
       .key("hoodie.precommit.validators")
       .defaultValue("")
       .withDocumentation("Comma separated list of class names that can be 
invoked to validate commit");
   public static final String VALIDATOR_TABLE_VARIABLE = "<TABLE_NAME>";
 
+  // Ethan: it seems that these SQL query configs are not used?  ALso, could 
we have one config for all SQL queries?

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -177,6 +181,9 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "this queue has no need for additional memory and cpu resources 
due to lock or multithreading, but also lost some benefits such as speed limit. 
"
           + "Although DISRUPTOR is still experimental.");
 
+  // Ethan: Should this be automatically determined based on the partition and 
record key column(s)?
+  // e.g., single column -> simple key gen, multiple column -> complex key gen

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -449,6 +469,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Table services such as compaction and clustering can 
fail and prevent syncing to "
           + "the metaclient. Set this to true to fail writes when table 
services fail");
 
+  // Ethan: do we still need all of the consistency check logic given major 
cloud storage are now strongly consistent?

Review Comment:
   Tracked in HUDI-5899



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -154,6 +161,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("Timeout in ms, to wait for establishing connection 
with Zookeeper.");
 
+  // Ethan: remove ZK_PORT so that ZK_CONNECT_URL also includes the port 
number?

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When true, spins up an instance of the timeline 
server (meta server that serves cached file listings, statistics),"
           + "running on each writer's driver process, accepting requests 
during the write from executors.");
 
+  // Ethan: is this still useful?

Review Comment:
   Tracked in HUDI-5892 for now



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLayoutConfig.java:
##########
@@ -37,12 +38,15 @@
     description = "Configurations that control storage layout and data 
distribution, "
         + "which defines how the files are organized within a table.")
 public class HoodieLayoutConfig extends HoodieConfig {
+  // Configs for Hudi storage layer
 
+  // Ethan: should this be a table config? I assume this cannot be changed 
once determined

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -139,6 +141,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Field used in preCombining before actual write. When 
two records have the same key value, "
           + "we will pick the one with the largest value for the precombine 
field, determined by Object.compareTo(..)");
 
+  // Ethan: regarding the namespace, should we use "hoodie.write." instead of 
"hoodie.datasource.write." as the prefix?

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -102,6 +111,7 @@ public class HoodieCleanConfig extends HoodieConfig {
           + " in the timeline, since the last cleaner run. This is much more 
efficient than obtaining listings for the full"
           + " table for each planning (even with a metadata table).");
 
+  // Ethan: this should be an internal config, based on concurrency control 
mode.  Should not be tweaked by user.

Review Comment:
   Tracked in HUDI-5893



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -111,11 +116,12 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("For DFS based lock providers, path to store the 
locks under. use Table's meta path as default");
 
+  // Ethan: should the default also allow expiration? Need to dig into the 
code to understand what this controls.
   public static final ConfigProperty<Integer> FILESYSTEM_LOCK_EXPIRE = 
ConfigProperty
-        .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
-        .defaultValue(0)
-        .sinceVersion("0.12.0")
-        .withDocumentation("For DFS based lock providers, expire time in 
minutes, must be a nonnegative number, default means no expire");
+      .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
+      .defaultValue(0)
+      .sinceVersion("0.12.0")
+      .withDocumentation("For DFS based lock providers, expire time in 
minutes, must be a nonnegative number, default means no expire");

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the 
spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
  * Configs/params used for internal purposes.
  */
 public class HoodieInternalConfig extends HoodieConfig {
+  // Some internal configs (?)
 
   private static final long serialVersionUID = 0L;
 
+  // Ethan: why do we need this? BulkInsertPartitioner can already indicate 
`arePartitionRecordsSorted()`
   public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = 
"hoodie.bulkinsert.are.partitioner.records.sorted";
   public static final Boolean 
DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
 
+  // Ethan: why do we need this?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -62,11 +64,14 @@ public class HoodieMemoryConfig extends HoodieConfig {
   // Minimum memory size (100MB) for the spillable map.
   public static final long DEFAULT_MIN_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES = 100 
* 1024 * 1024L;
 
+  // Ethan: Should this be automatically configured as a fraction of memory

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed 
during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this 
one intends to control as well?

Review Comment:
   Tracked in HUDI-5892 for now



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When true, spins up an instance of the timeline 
server (meta server that serves cached file listings, statistics),"
           + "running on each writer's driver process, accepting requests 
during the write from executors.");
 
+  // Ethan: is this still useful?

Review Comment:
   @bvaradar should we remove all relevant logic on reusing the timeline server?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed 
during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this 
one intends to control as well?
   public static final ConfigProperty<String> 
MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE = ConfigProperty
       .key("hoodie.merge.allow.duplicate.on.inserts")
       .defaultValue("false")
       .withDocumentation("When enabled, we allow duplicate keys even if 
inserts are routed to merge with an existing file (for ensuring file sizing)."
           + " This is only relevant for insert operation, since upsert, delete 
operations will ensure unique key constraints are maintained.");
 
+  // Ethan: should we make the default 0, to reduce the write amplication and 
make MOR write fast?

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -212,6 +220,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "Hudi stores all the main meta-data about commits, savepoints, 
cleaning audit logs "
           + "etc in .hoodie directory under this base path directory.");
 
+  // Ethan: is this still necessary?

Review Comment:
   Tracked in HUDI-5893



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the 
spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?

Review Comment:
   @nsivabalan let's discuss this too



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed 
during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this 
one intends to control as well?

Review Comment:
   I'll go through the code.  @nsivabalan do you have any concern around this?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -66,14 +66,18 @@
     areCommonConfigs = true,
     description = "")
 public class HoodieLockConfig extends HoodieConfig {
+  // Configs for common and legacy lock providers
 
+  // Ethan: nit: this is initial delay
   public static final ConfigProperty<String> 
LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS = ConfigProperty
       .key(LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY)
       .defaultValue(DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS)
       .sinceVersion("0.8.0")
       .withDocumentation("Initial amount of time to wait between retries to 
acquire locks, "
           + " subsequent retries will exponentially backoff.");
 
+  // Ethan: default should be larger? like 16s?  we already have exponential 
backoff. sth like 1 -> 2 -> 4 -> 8 -> 16 -> 16 ...
+  // currently, retry times is 15.  The change increases the retry period from 
~1min to ~4min.

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -185,6 +192,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "**Note** This is being actively worked on. Please use "
           + "`hoodie.datasource.write.keygenerator.class` instead.");
 
+  // Ethan: look like we can remove this as it is mature?

Review Comment:
   Tracked in HUDI-5782.  Likely we'll leave this alone.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -53,12 +56,13 @@ public class HoodiePayloadConfig extends HoodieConfig {
       .withDocumentation("Table column/field name to derive timestamp 
associated with the records. This can"
           + "be useful for e.g, determining the freshness of the table.");
 
+  // Ethan: remove this and align with the write payload class 
("hoodie.datasource.write.payload.class")?

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -310,6 +321,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When inserted records share same key, controls 
whether they should be first combined (i.e de-duplicated) before"
           + " writing to storage.");
 
+  // Ethan: should we have one "hoodie.combine.before.write" for all change 
operations instead of one config per operation?

Review Comment:
   Sg.  Keep them as the same.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -40,7 +40,10 @@
     description = "Payload related configs, that can be leveraged to "
         + "control merges based on specific business fields in the data.")
 public class HoodiePayloadConfig extends HoodieConfig {
+  // Configs for Hudi payload class
 
+  // Ethan: shall we use preCombine 
("hoodie.datasource.write.precombine.field") only and remove this?
+  // It acts as the same purpose as the precombine field.  This confuses user 
too.

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -259,6 +263,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "Consistent hashing supports dynamic resizing of the number of 
bucket, solving potential data skew and file size "
           + "issues of the SIMPLE hashing engine. Consistent hashing only 
works with MOR tables, only use simple hashing on COW tables.");
 
+  // Ethan: is the default too large?

Review Comment:
   Tracked in HUDI-5782



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a diff in pull request #7912: [HUDI-5739] Review Hudi Config Defaults

Reply via email to