[GitHub] [hudi] vinothchandar commented on a diff in pull request #7912: [HUDI-5739] Review Hudi Config Defaults

via GitHub Wed, 01 Mar 2023 17:26:05 -0800


vinothchandar commented on code in PR #7912:
URL: https://github.com/apache/hudi/pull/7912#discussion_r1122450879



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -40,10 +40,13 @@
         + " between writers to a Hudi table. Concurrency between Hudi's own 
table services "
         + " are auto managed internally.")
 public class DynamoDbBasedLockConfig extends HoodieConfig {
+  // Config for DynamoDB based lock provider
 
   // configs for DynamoDb based locks
   public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX 
+ "dynamodb.";
 
+  // Ethan: Can we provide a default value here that is less likely to collide,
+  // So this config also becomes optional?

Review Comment:
   the lock provider should do use sth like `hudi_locks` as the default table 
name? then the user can override if needed.



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity 
units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?

Review Comment:
   shorten to 2?



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity 
units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?
   public static final ConfigProperty<String> 
DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT = ConfigProperty
       .key(DYNAMODB_BASED_LOCK_PROPERTY_PREFIX + "table_creation_timeout")
       .defaultValue(String.valueOf(10 * 60 * 1000))
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, the maximum number 
of milliseconds to wait for creating DynamoDB table");
 
+  // Ethan: Move the infer logic here

Review Comment:
   +1 for sane default



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?
   public static final ConfigProperty<String> AUTO_ARCHIVE = ConfigProperty
       .key("hoodie.archive.automatic")
       .defaultValue("true")
       .withDocumentation("When enabled, the archival table service is invoked 
immediately after each commit,"
           + " to archive commits if we cross a maximum value of commits."
           + " It's recommended to enable this, to ensure number of active 
commits is bounded.");
 
+  // Ethan: Is Async archive tested?

Review Comment:
   We should ensure this is safe.. for now advanced config



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size 
grows."
           + " This config controls the maximum number of instants to retain in 
the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove 
this?
   public static final ConfigProperty<Integer> 
DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.archive.delete.parallelism")
       .defaultValue(100)
       .withDocumentation("Parallelism for deleting archived hoodie commits.");
 
+  // Ethan: pick one of "hoodie.keep.min.commits" and 
"hoodie.keep.max.commits" only?

Review Comment:
   min/max was added to batch up the archived files. we leave this alone IMO>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?

Review Comment:
   should this become an unsafe/advanced config instead?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -47,6 +47,7 @@
     description = "Configurations that control the clustering table service in 
hudi, "
         + "which optimizes the storage layout for better query performance by 
sorting and sizing data files.")
 public class HoodieClusteringConfig extends HoodieConfig {
+  // Configs for Clustering operations

Review Comment:
   revisit all docs here, and make it very user friendly



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -102,6 +111,7 @@ public class HoodieCleanConfig extends HoodieConfig {
           + " in the timeline, since the last cleaner run. This is much more 
efficient than obtaining listings for the full"
           + " table for each planning (even with a metadata table).");
 
+  // Ethan: this should be an internal config, based on concurrency control 
mode.  Should not be tweaked by user.

Review Comment:
   make this advanced and move on.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour 
to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition 
of which index is divided by 24 and the remainder is equal to the current 
hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config 
to indicate the purpose:

Review Comment:
   we should simplify this. esp the default. IIRC we only cluster within the 
group? so picking group size is important?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be 
straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to 
schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute 
compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other
+  // hoodie.compact.schedule.inline
+  // hoodie.compact.inline
+  // hoodie.datasource.compaction.async.enable
+  // // these are reused for inline/async
+  // hoodie.compact.inline.trigger.strategy
+  // hoodie.compact.inline.max.delta.seconds
+  // hoodie.compact.inline.max.delta.commits
   public static final ConfigProperty<String> INLINE_COMPACT = ConfigProperty
       .key("hoodie.compact.inline")
       .defaultValue("false")

Review Comment:
   why is default false? surprised me.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size 
grows."
           + " This config controls the maximum number of instants to retain in 
the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove 
this?

Review Comment:
   idk if there is a way to do it more intelligently, workload based. i.e based 
on number of isntants deleted or sth?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -132,13 +151,15 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "compaction during each compaction run. By default. Hudi picks the 
log file "
           + "with most accumulated unmerged data");
 
+  // Ethan: this seems to be internal
   public static final ConfigProperty<String> COMPACTION_LAZY_BLOCK_READ_ENABLE 
= ConfigProperty
       .key("hoodie.compaction.lazy.block.read")
       .defaultValue("true")
       .withDocumentation("When merging the delta log files, this config helps 
to choose whether the log blocks "
           + "should be read lazily or not. Choose true to use lazy block 
reading (low memory usage, but incurs seeks to each block"
           + " header) or false for immediate block read (higher memory 
usage)");
 
+  // Ethan: this seems to be internal

Review Comment:
   who uses this code path?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -175,6 +198,9 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + " record sizes. It's recommended to keep this turned on, since 
hand tuning is otherwise extremely"
           + " cumbersome.");
 
+  // Ethan: Should this be removed?  This should only be used by the first 
batch.  For first batch,

Review Comment:
   dont we have this in writecofnig as well. I also don't like that this is COW 
specific



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -78,6 +81,7 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .withDocumentation("For DynamoDB based lock provider, the region used in 
endpoint for Amazon DynamoDB service."
                          + " Would try to first get it from AWS_REGION 
environment variable. If not find, by default use us-east-1");
 
+  // Ethan: Need to confirm these are good defaults for cost

Review Comment:
   I think this is fine.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Class to use for reading the bootstrap dataset 
partitions/files, for Bootstrap mode FULL_RECORD");
 
+  // Ethan: can this be aligned with key gen for write (no need for a separate 
bootstrap key gen config)?

Review Comment:
   +1 sane defaults



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -99,6 +103,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("Maximum number of times to retry to acquire lock 
additionally from the lock manager.");
 
+  // Ethan: is this too large? what about 30s?

Review Comment:
   idk



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -111,11 +116,12 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("For DFS based lock providers, path to store the 
locks under. use Table's meta path as default");
 
+  // Ethan: should the default also allow expiration? Need to dig into the 
code to understand what this controls.
   public static final ConfigProperty<Integer> FILESYSTEM_LOCK_EXPIRE = 
ConfigProperty
-        .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
-        .defaultValue(0)
-        .sinceVersion("0.12.0")
-        .withDocumentation("For DFS based lock providers, expire time in 
minutes, must be a nonnegative number, default means no expire");
+      .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
+      .defaultValue(0)
+      .sinceVersion("0.12.0")
+      .withDocumentation("For DFS based lock providers, expire time in 
minutes, must be a nonnegative number, default means no expire");

Review Comment:
   DFS lock providers, real?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -139,6 +141,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Field used in preCombining before actual write. When 
two records have the same key value, "
           + "we will pick the one with the largest value for the precombine 
field, determined by Object.compareTo(..)");
 
+  // Ethan: regarding the namespace, should we use "hoodie.write." instead of 
"hoodie.datasource.write." as the prefix?

Review Comment:
   payload should be common? lets see how the new merger stuff is named and fix 
there? idk how painful this migration will be. But agree on `hoodie.write..` 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -40,7 +40,10 @@
     description = "Payload related configs, that can be leveraged to "
         + "control merges based on specific business fields in the data.")
 public class HoodiePayloadConfig extends HoodieConfig {
+  // Configs for Hudi payload class
 
+  // Ethan: shall we use preCombine 
("hoodie.datasource.write.precombine.field") only and remove this?
+  // It acts as the same purpose as the precombine field.  This confuses user 
too.

Review Comment:
   what about callers of RDDClient e.g DeltaStreamer. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -46,6 +46,7 @@
         + "The bootstrap operation can flexibly avoid copying data over before 
you can use Hudi and support running the existing "
         + " writers and new hudi writers in parallel, to validate the 
migration.")
 public class HoodieBootstrapConfig extends HoodieConfig {
+  // Configs for Bootstrap feature

Review Comment:
   a lot of them may overlap with write configs. good low hanging fruits here.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -62,11 +64,14 @@ public class HoodieMemoryConfig extends HoodieConfig {
   // Minimum memory size (100MB) for the spillable map.
   public static final long DEFAULT_MIN_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES = 100 
* 1024 * 1024L;
 
+  // Ethan: Should this be automatically configured as a fraction of memory

Review Comment:
   leave alone?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -84,11 +91,13 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Controls how cleaning is scheduled. Valid options: "
           + 
Arrays.stream(CleaningTriggerStrategy.values()).map(Enum::name).collect(Collectors.joining(",")));
 
+  // Ethan: "hoodie.clean.trigger.max.commits" is what this means

Review Comment:
   +1 again on aligining with how we talk about scheduling and executing table 
services. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback 
of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily 
by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?
   public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE = 
ConfigProperty
       .key("hoodie.cleaner.parallelism")
       .defaultValue("200")
       .withDocumentation("Parallelism for the cleaning operation. Increase 
this if cleaning becomes slow.");
 
+  // Ethan: wondering if this is still needed.
   public static final ConfigProperty<Boolean> ALLOW_MULTIPLE_CLEANS = 
ConfigProperty
       .key("hoodie.clean.allow.multiple")
       .defaultValue(true)
       .sinceVersion("0.11.0")
       .withDocumentation("Allows scheduling/executing multiple cleans by 
enabling this config. If users prefer to strictly ensure clean requests should 
be mutually exclusive, "
           + ".i.e. a 2nd clean will not be scheduled if another clean is not 
yet completed to avoid repeat cleaning of same files, they might want to 
disable this config.");
 
+  // Ethan: should the default behavior be deleting original parquet files and 
then we can remove this config?

Review Comment:
   we should ask Udit, why this is the default



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -41,10 +42,12 @@
  */
 @Immutable
 @ConfigClassProperty(name = "Clean Configs",
-        groupName = ConfigGroups.Names.WRITE_CLIENT,
-        description = "Cleaning (reclamation of older/unused file 
groups/slices).")
+    groupName = ConfigGroups.Names.WRITE_CLIENT,
+    description = "Cleaning (reclamation of older/unused file groups/slices).")
 public class HoodieCleanConfig extends HoodieConfig {
+  // Configs for Clean action
 
+  // Ethan: should we standardize on `.enable` instead of `.automatic`?

Review Comment:
   +1 as long as its backwards compatible.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -131,18 +134,21 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Turn on inline clustering - clustering will be run 
after each write operation is complete")
       .withAlternatives("hoodie.datasource.clustering.inline.enable");
 
+  // Ethan: this can apply to both inline or aysnc clustering.  Rename it for 
better readability.

Review Comment:
   +1



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -58,12 +61,16 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Only applies when " + AUTO_CLEAN.key() + " is turned 
on. "
           + "When turned on runs cleaner async with writing, which can speed 
up overall write performance.");
 
+  // Ethan: when either "hoodie.cleaner.commits.retained", 
"hoodie.cleaner.hours.retained",

Review Comment:
   so, we infer the cleaning policy and throw errors if multiple of them are 
set?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -53,12 +56,13 @@ public class HoodiePayloadConfig extends HoodieConfig {
       .withDocumentation("Table column/field name to derive timestamp 
associated with the records. This can"
           + "be useful for e.g, determining the freshness of the table.");
 
+  // Ethan: remove this and align with the write payload class 
("hoodie.datasource.write.payload.class")?

Review Comment:
   yes we should infer from table props?>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -185,6 +192,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "**Note** This is being actively worked on. Please use "
           + "`hoodie.datasource.write.keygenerator.class` instead.");
 
+  // Ethan: look like we can remove this as it is mature?

Review Comment:
   same for mdt as well?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -310,6 +321,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When inserted records share same key, controls 
whether they should be first combined (i.e de-duplicated) before"
           + " writing to storage.");
 
+  // Ethan: should we have one "hoodie.combine.before.write" for all change 
operations instead of one config per operation?

Review Comment:
   write will include insert?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When true, spins up an instance of the timeline 
server (meta server that serves cached file listings, statistics),"
           + "running on each writer's driver process, accepting requests 
during the write from executors.");
 
+  // Ethan: is this still useful?

Review Comment:
   reuse is not that stable/provabley?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -218,6 +229,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Enable running of clustering service, asynchronously 
as inserts happen on the table.")
       .withAlternatives("hoodie.datasource.clustering.async.enable");
 
+  // Ethan: this should be removed.  User should not tweak this to corrupt the 
hoodie_commit_time meta field.

Review Comment:
   user should not have control. we should preserve commit_time for internal 
write operations (compaction, clustering.. ) . Think about any historical debt, 
before making final call.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -90,6 +91,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("End partition used to filter partition (inclusive), 
only effective when the filter mode '"
           + PLAN_PARTITION_FILTER_MODE + "' is " + 
ClusteringPlanPartitionFilterMode.SELECTED_PARTITIONS.name());
 
+  // Ethan: Should this be aligned with default parquet base file size?

Review Comment:
   we could infer this from base file size. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed 
during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this 
one intends to control as well?

Review Comment:
   logically makes sense. but I would check the code for side effects.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be 
straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to 
schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute 
compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other

Review Comment:
   whatever we do here, we do across the board for all services? We need to 
factor in the TMS rfcs as well



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -449,6 +469,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Table services such as compaction and clustering can 
fail and prevent syncing to "
           + "the metaclient. Set this to true to fail writes when table 
services fail");
 
+  // Ethan: do we still need all of the consistency check logic given major 
cloud storage are now strongly consistent?

Review Comment:
   leave it in, but make sure its not affecting code paths by default?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -177,6 +179,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "When true, the incoming writes will cached to speed up index 
lookup by reducing IO "
           + "for computing parallelism or affected partitions");
 
+  // Ethan: should the parallelim below be dynamically determined by Spark?

Review Comment:
   probably makes sense to do this. but wondering if the index implementations 
will already do this.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -329,18 +341,22 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Write status objects hold metadata about a write 
(stats, errors), that is not yet committed to storage. "
           + "This controls the how that information is cached for inspection 
by clients. We rarely expect this to be changed.");
 
+  // Ethan: this should be an internal config.
   public static final ConfigProperty<String> AUTO_COMMIT_ENABLE = 
ConfigProperty
       .key("hoodie.auto.commit")
       .defaultValue("true")
       .withDocumentation("Controls whether a write operation should auto 
commit. This can be turned off to perform inspection"
           + " of the uncommitted write before deciding to commit.");
 
+  // Ethan: we never talk about this in docs :)

Review Comment:
   yes that coz Uber extends this :)



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -211,6 +214,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies when #simpleIndexUseCaching is set. 
Determine what level of persistence is used to cache input RDDs. "
           + "Refer to org.apache.spark.storage.StorageLevel for different 
values");
 
+  // Ethan: should we remove these two and avoid data quality issues?

Review Comment:
   yes.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -521,11 +545,13 @@ public class HoodieWriteConfig extends HoodieConfig {
    * Given the importance of supporting such cases for the user's migration to 
0.5.x, we are proposing a safety flag
    * (disabled by default) which will allow this old behavior.
    */
+  // Ethan: is this still needed?

Review Comment:
   this is used by uber. we should confirm with them.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?
   public static final ConfigProperty<String> AUTO_ARCHIVE = ConfigProperty
       .key("hoodie.archive.automatic")
       .defaultValue("true")
       .withDocumentation("When enabled, the archival table service is invoked 
immediately after each commit,"
           + " to archive commits if we cross a maximum value of commits."
           + " It's recommended to enable this, to ensure number of active 
commits is bounded.");
 
+  // Ethan: Is Async archive tested?

Review Comment:
   mark for table service reliability efforts. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
  * Configs/params used for internal purposes.
  */
 public class HoodieInternalConfig extends HoodieConfig {
+  // Some internal configs (?)
 
   private static final long serialVersionUID = 0L;
 
+  // Ethan: why do we need this? BulkInsertPartitioner can already indicate 
`arePartitionRecordsSorted()`
   public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = 
"hoodie.bulkinsert.are.partitioner.records.sorted";
   public static final Boolean 
DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
 
+  // Ethan: why do we need this?

Review Comment:
   we should try and remove this.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieHBaseIndexConfig.java:
##########
@@ -36,6 +36,7 @@
         + "(when HBase based indexing is enabled), which tags incoming "
         + "records as either inserts or updates to older records.")
 public class HoodieHBaseIndexConfig extends HoodieConfig {
+  // Configs for HBase Index

Review Comment:
   lets leave this class alone.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -539,6 +565,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Whether to allow generation of empty commits, even 
if no data was written in the commit. "
           + "It's useful in cases where extra metadata needs to be published 
regardless e.g tracking source offsets when ingesting data");
 
+  // Ethan: maybe this can auto turned on for CDC?

Review Comment:
   idk if we add this for CDC. do we?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -148,6 +149,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "When true, interval tree based file pruning optimization is 
enabled. "
           + "This mode speeds-up file-pruning based on key ranges when 
compared with the brute-force mode");
 
+  // Ethan: remove this config and clean up relevant logic?

Review Comment:
   advanced



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed 
during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this 
one intends to control as well?
   public static final ConfigProperty<String> 
MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE = ConfigProperty
       .key("hoodie.merge.allow.duplicate.on.inserts")
       .defaultValue("false")
       .withDocumentation("When enabled, we allow duplicate keys even if 
inserts are routed to merge with an existing file (for ensuring file sizing)."
           + " This is only relevant for insert operation, since upsert, delete 
operations will ensure unique key constraints are maintained.");
 
+  // Ethan: should we make the default 0, to reduce the write amplication and 
make MOR write fast?

Review Comment:
   leave this alone.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size 
grows."
           + " This config controls the maximum number of instants to retain in 
the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove 
this?

Review Comment:
   tbh idk if this is a real problem to go after.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour 
to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition 
of which index is divided by 24 and the remainder is equal to the current 
hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config 
to indicate the purpose:

Review Comment:
   30 is kind of arbitrary.. We should rethink this fundamentally.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the 
spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?

Review Comment:
   we should to the extent that its reliable.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -90,6 +91,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("End partition used to filter partition (inclusive), 
only effective when the filter mode '"
           + PLAN_PARTITION_FILTER_MODE + "' is " + 
ClusteringPlanPartitionFilterMode.SELECTED_PARTITIONS.name());
 
+  // Ethan: Should this be aligned with default parquet base file size?

Review Comment:
   I would read the code though to understand if it has side effects i.e the 
Hudi will always leave 100MB files, so will clustering ever kick in? we need to 
trigger clustering based on col stats and how clustered the table is. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -66,14 +66,18 @@
     areCommonConfigs = true,
     description = "")
 public class HoodieLockConfig extends HoodieConfig {
+  // Configs for common and legacy lock providers
 
+  // Ethan: nit: this is initial delay
   public static final ConfigProperty<String> 
LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS = ConfigProperty
       .key(LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY)
       .defaultValue(DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS)
       .sinceVersion("0.8.0")
       .withDocumentation("Initial amount of time to wait between retries to 
acquire locks, "
           + " subsequent retries will exponentially backoff.");
 
+  // Ethan: default should be larger? like 16s?  we already have exponential 
backoff. sth like 1 -> 2 -> 4 -> 8 -> 16 -> 16 ...
+  // currently, retry times is 15.  The change increases the retry period from 
~1min to ~4min.

Review Comment:
   +1



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -310,6 +321,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When inserted records share same key, controls 
whether they should be first combined (i.e de-duplicated) before"
           + " writing to storage.");
 
+  // Ethan: should we have one "hoodie.combine.before.write" for all change 
operations instead of one config per operation?

Review Comment:
   leave this alone?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -40,7 +40,10 @@
     description = "Payload related configs, that can be leveraged to "
         + "control merges based on specific business fields in the data.")
 public class HoodiePayloadConfig extends HoodieConfig {
+  // Configs for Hudi payload class
 
+  // Ethan: shall we use preCombine 
("hoodie.datasource.write.precombine.field") only and remove this?
+  // It acts as the same purpose as the precombine field.  This confuses user 
too.

Review Comment:
   could we infer this ?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be 
straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to 
schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute 
compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other
+  // hoodie.compact.schedule.inline
+  // hoodie.compact.inline
+  // hoodie.datasource.compaction.async.enable
+  // // these are reused for inline/async
+  // hoodie.compact.inline.trigger.strategy
+  // hoodie.compact.inline.max.delta.seconds
+  // hoodie.compact.inline.max.delta.commits
   public static final ConfigProperty<String> INLINE_COMPACT = ConfigProperty
       .key("hoodie.compact.inline")
       .defaultValue("false")

Review Comment:
   `"hoodie.compact.schedule.inline"` can be redone similar to how cleaner, 
others are. Smaller bandaid?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies if index type is BUCKET. Determine the 
number of buckets in the hudi table, "
           + "and each partition is divided to N buckets.");
 
+  // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets": 
e.g., min=0.5*num, max=4*num?

Review Comment:
   yeah having some default, but overridde-able is ok



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLayoutConfig.java:
##########
@@ -37,12 +38,15 @@
     description = "Configurations that control storage layout and data 
distribution, "
         + "which defines how the files are organized within a table.")
 public class HoodieLayoutConfig extends HoodieConfig {
+  // Configs for Hudi storage layer
 
+  // Ethan: should this be a table config? I assume this cannot be changed 
once determined

Review Comment:
   yes. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour 
to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition 
of which index is divided by 24 and the remainder is equal to the current 
hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config 
to indicate the purpose:

Review Comment:
   These groups were added to run clustering in smaller rounds, for large scale 
use-cases. We should make it simple for out of box medium scale data



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java:
##########
@@ -38,13 +38,15 @@
     groupName = ConfigGroups.Names.SPARK_DATASOURCE,
     description = "The following set of configurations help validate new data 
before commits.")
 public class HoodiePreCommitValidatorConfig extends HoodieConfig {
+  // Configs for precommit validator
 
   public static final ConfigProperty<String> VALIDATOR_CLASS_NAMES = 
ConfigProperty
       .key("hoodie.precommit.validators")
       .defaultValue("")
       .withDocumentation("Comma separated list of class names that can be 
invoked to validate commit");
   public static final String VALIDATOR_TABLE_VARIABLE = "<TABLE_NAME>";
 
+  // Ethan: it seems that these SQL query configs are not used?  ALso, could 
we have one config for all SQL queries?

Review Comment:
   +1 for simplification



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -185,6 +192,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "**Note** This is being actively worked on. Please use "
           + "`hoodie.datasource.write.keygenerator.class` instead.");
 
+  // Ethan: look like we can remove this as it is mature?

Review Comment:
   restore does not use this. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -154,6 +161,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("Timeout in ms, to wait for establishing connection 
with Zookeeper.");
 
+  // Ethan: remove ZK_PORT so that ZK_CONNECT_URL also includes the port 
number?

Review Comment:
   I would leave this alone. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -212,6 +220,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "Hudi stores all the main meta-data about commits, savepoints, 
cleaning audit logs "
           + "etc in .hoodie directory under this base path directory.");
 
+  // Ethan: is this still necessary?

Review Comment:
   advanced config. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -177,6 +181,9 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "this queue has no need for additional memory and cpu resources 
due to lock or multithreading, but also lost some benefits such as speed limit. 
"
           + "Although DISRUPTOR is still experimental.");
 
+  // Ethan: Should this be automatically determined based on the partition and 
record key column(s)?
+  // e.g., single column -> simple key gen, multiple column -> complex key gen

Review Comment:
   +1.. other cases, users can explicitly configure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] vinothchandar commented on a diff in pull request #7912: [HUDI-5739] Review Hudi Config Defaults

Reply via email to