yihua commented on code in PR #7912:
URL: https://github.com/apache/hudi/pull/7912#discussion_r1128765984
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -93,6 +98,7 @@ public class HoodieArchivalConfig extends HoodieConfig {
.withDocumentation("When enable, hoodie will auto merge several small
archive files into larger one. It's"
+ " useful when storage scheme doesn't support append operation.");
+ // Ethan: Once this is well tested, we should remove the feature flag
Review Comment:
@nsivabalan could we remove this config now?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
+ "failed writes to re-claim space. Choose to perform this rollback
of failed writes eagerly before "
+ "every writer starts (only supported for single writer) or lazily
by the cleaner (required for multi-writers)");
+ // Ethan: same here. Can we reuse Spark parallelism or derive automatically?
Review Comment:
Tracked in HUDI-5894
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -108,6 +110,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
.sinceVersion("0.11.0")
.withDocumentation("Partitions to run clustering");
+ // Ethan: enum instead of class name to determine strategy type? So this
can be engine-agnostic
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
+ "failed writes to re-claim space. Choose to perform this rollback
of failed writes eagerly before "
+ "every writer starts (only supported for single writer) or lazily
by the cleaner (required for multi-writers)");
+ // Ethan: same here. Can we reuse Spark parallelism or derive automatically?
public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE =
ConfigProperty
.key("hoodie.cleaner.parallelism")
.defaultValue("200")
.withDocumentation("Parallelism for the cleaning operation. Increase
this if cleaning becomes slow.");
+ // Ethan: wondering if this is still needed.
Review Comment:
@nsivabalan do you think we still need this?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -201,6 +210,8 @@ public class HoodieClusteringConfig extends HoodieConfig {
.withDocumentation("Determines how to handle updates, deletes to file
groups that are under clustering."
+ " Default strategy just rejects the update");
+ // Ethan: these configs of inline, async, schedule, execute can be merged
into one mode config:
Review Comment:
Tracked in HUDI-5896
##########
hudi-aws/src/main/java/org/apache/hudi/config/HoodieAWSConfig.java:
##########
@@ -41,20 +42,23 @@
*/
@Immutable
@ConfigClassProperty(name = "Amazon Web Services Configs",
- groupName = ConfigGroups.Names.AWS,
- description = "Amazon Web Services configurations to access resources
like Amazon DynamoDB (for locks),"
- + " Amazon CloudWatch (metrics).")
+ groupName = ConfigGroups.Names.AWS,
+ description = "Amazon Web Services configurations to access resources like
Amazon DynamoDB (for locks),"
+ + " Amazon CloudWatch (metrics).")
public class HoodieAWSConfig extends HoodieConfig {
+ // AWS Access Configs
+
+ // Ethan: These configs should only be necessary if no environmental
variables are not set (to confirm with code)
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -196,6 +222,7 @@ public class HoodieCompactionConfig extends HoodieConfig {
.withDocumentation("New optimized scan for log blocks that handles all
multi-writer use-cases while appending to log files. "
+ "It also differentiates original blocks written by ingestion
writers and compacted blocks written log compaction.");
+ // Ethan: These deprecated configs should be removed. Similar in other
classes
Review Comment:
Leave these alone for 0.x.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
.sinceVersion("0.6.0")
.withDocumentation("Class to use for reading the bootstrap dataset
partitions/files, for Bootstrap mode FULL_RECORD");
+ // Ethan: can this be aligned with key gen for write (no need for a separate
bootstrap key gen config)?
public static final ConfigProperty<String> KEYGEN_CLASS_NAME = ConfigProperty
.key("hoodie.bootstrap.keygen.class")
.noDefaultValue()
.sinceVersion("0.6.0")
.withDocumentation("Key generator implementation to be used for
generating keys from the bootstrapped dataset");
+ // Ethan: similar here
public static final ConfigProperty<String> KEYGEN_TYPE = ConfigProperty
.key("hoodie.bootstrap.keygen.type")
.defaultValue(KeyGeneratorType.SIMPLE.name())
.sinceVersion("0.9.0")
.withDocumentation("Type of build-in key generator, currently support
SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE");
+ // Ethan: better default translator: if seeing hive-style partitioning,
automatically parse the partition value?
public static final ConfigProperty<String>
PARTITION_PATH_TRANSLATOR_CLASS_NAME = ConfigProperty
.key("hoodie.bootstrap.partitionpath.translator.class")
.defaultValue(IdentityBootstrapPartitionPathTranslator.class.getName())
.sinceVersion("0.6.0")
.withDocumentation("Translates the partition paths from the bootstrapped
data into how is laid out as a Hudi table.");
+ // Ethan: can this be aligned with Spark parallelism?
Review Comment:
Tracked in HUDI-5894
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -259,6 +263,7 @@ public class HoodieIndexConfig extends HoodieConfig {
+ "Consistent hashing supports dynamic resizing of the number of
bucket, solving potential data skew and file size "
+ "issues of the SIMPLE hashing engine. Consistent hashing only
works with MOR tables, only use simple hashing on COW tables.");
+ // Ethan: is the default too large?
Review Comment:
@yuzhaojing @minihippo could you confirm if the default is too large for
medium-sized table?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -77,6 +82,7 @@ public class HoodieMemoryConfig extends HoodieConfig {
.defaultValue(16 * 1024 * 1024)
.withDocumentation("Property to control the max memory in bytes for dfs
input stream buffer size");
+ // Ethan: auto configured using table base path?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -135,6 +141,7 @@ public class HoodieLockConfig extends HoodieConfig {
.sinceVersion("0.8.0")
.withDocumentation("For Hive based lock provider, the Hive metastore URI
to acquire locks against.");
+ // Ethan: set this automatically based on the table base path?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -279,6 +288,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.defaultValue("0")
.withDocumentation("Parallelism used for “delete” operation. Delete
operations also performs shuffles, similar to upsert operation.");
+ // Ethan: does this need to be automatically set based on the input like
other parallelism?
Review Comment:
Tracked in HUDI-5894
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteCommitCallbackConfig.java:
##########
@@ -65,6 +66,7 @@ public class HoodieWriteCommitCallbackConfig extends
HoodieConfig {
.sinceVersion("0.6.0")
.withDocumentation("Http callback API key.
hudi_write_commit_http_callback by default");
+ // Ethan: default is too low? 30s instead?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -329,18 +341,22 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("Write status objects hold metadata about a write
(stats, errors), that is not yet committed to storage. "
+ "This controls the how that information is cached for inspection
by clients. We rarely expect this to be changed.");
+ // Ethan: this should be an internal config.
public static final ConfigProperty<String> AUTO_COMMIT_ENABLE =
ConfigProperty
.key("hoodie.auto.commit")
.defaultValue("true")
.withDocumentation("Controls whether a write operation should auto
commit. This can be turned off to perform inspection"
+ " of the uncommitted write before deciding to commit.");
+ // Ethan: we never talk about this in docs :)
public static final ConfigProperty<String> WRITE_STATUS_CLASS_NAME =
ConfigProperty
.key("hoodie.writestatus.class")
.defaultValue(WriteStatus.class.getName())
.withDocumentation("Subclass of " + WriteStatus.class.getName() + " to
be used to collect information about a write. Can be "
+ "overridden to collection additional metrics/statistics about the
data if needed.");
+ // Ethan: similarly, we should revisit all parallelism and see if they can
be auto-configured
+ // instead of a static default.
Review Comment:
Tacked in HUDI-5894
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -289,6 +299,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.defaultValue(String.valueOf(4 * 1024 * 1024))
.withDocumentation("Size of in-memory buffer used for parallelizing
network reads and lake storage writes.");
+ // Ethan: is 1024B too small as the default?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -501,6 +524,7 @@ public class HoodieWriteConfig extends HoodieConfig {
+ "OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on
the table and exactly one of them succeed "
+ "if a conflict (writes affect the same file group) is detected.");
+ // Ethan: need to check why we need both "hoodie.avro.schema" and
"hoodie.write.schema".
Review Comment:
Will leave the schema configs alone for now.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
.sinceVersion("0.11.0")
.withDocumentation("Control to enable release all persist rdds when the
spark job finish.");
+ // Ethan: shall we remove this and do auto adjustment everywhere?
public static final ConfigProperty<Boolean> AUTO_ADJUST_LOCK_CONFIGS =
ConfigProperty
.key("hoodie.auto.adjust.lock.configs")
.defaultValue(false)
.sinceVersion("0.11.0")
.withDocumentation("Auto adjust lock configurations when metadata table
is enabled and for async table services.");
+ // Ethan: should we remove this and make the validation mandatory?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -375,6 +391,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.sinceVersion("0.9.0")
.withDocumentation("The batch interval in milliseconds for marker
creation batch processing");
+ // Ethan: similar here for revisiting parallelism
Review Comment:
Tracked in HUDI-5894
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -40,6 +40,7 @@
+ "This table maintains the metadata about a given Hudi table (e.g
file listings) "
+ " to avoid overhead of accessing cloud storage, during queries.")
public final class HoodieMetadataConfig extends HoodieConfig {
+ //cfg
Review Comment:
Tacked in HUDI-5900
##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -162,6 +163,7 @@ public class HoodieTableConfig extends HoodieConfig {
.noDefaultValue()
.withDocumentation("Version of timeline used, by the table.");
+ // Ethan: to be merged with write payload class
Review Comment:
Tracked in HUDI-5892
##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
.sinceVersion("0.10.0")
.withDocumentation("For DynamoDB based lock provider, write capacity
units when using PROVISIONED billing mode");
+ // Ethan: 10 min. Is this timeout too long?
Review Comment:
Tracked in HUDI-5782
##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
.sinceVersion("0.10.0")
.withDocumentation("For DynamoDB based lock provider, write capacity
units when using PROVISIONED billing mode");
+ // Ethan: 10 min. Is this timeout too long?
public static final ConfigProperty<String>
DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT = ConfigProperty
.key(DYNAMODB_BASED_LOCK_PROPERTY_PREFIX + "table_creation_timeout")
.defaultValue(String.valueOf(10 * 60 * 1000))
.sinceVersion("0.10.0")
.withDocumentation("For DynamoDB based lock provider, the maximum number
of milliseconds to wait for creating DynamoDB table");
+ // Ethan: Move the infer logic here
Review Comment:
Tracked in HUDI-5892
##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -40,10 +40,13 @@
+ " between writers to a Hudi table. Concurrency between Hudi's own
table services "
+ " are auto managed internally.")
public class DynamoDbBasedLockConfig extends HoodieConfig {
+ // Config for DynamoDB based lock provider
// configs for DynamoDb based locks
public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX
+ "dynamodb.";
+ // Ethan: Can we provide a default value here that is less likely to collide,
+ // So this config also becomes optional?
Review Comment:
Tracked in HUDI-5782
##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -78,6 +81,7 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
.withDocumentation("For DynamoDB based lock provider, the region used in
endpoint for Amazon DynamoDB service."
+ " Would try to first get it from AWS_REGION
environment variable. If not find, by default use us-east-1");
+ // Ethan: Need to confirm these are good defaults for cost
Review Comment:
Sg
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
groupName = ConfigGroups.Names.WRITE_CLIENT,
description = "Configurations that control archival.")
public class HoodieArchivalConfig extends HoodieConfig {
+ //cfg
+ // Ethan: Should we remove this? Archive should always be enabled?
public static final ConfigProperty<String> AUTO_ARCHIVE = ConfigProperty
.key("hoodie.archive.automatic")
.defaultValue("true")
.withDocumentation("When enabled, the archival table service is invoked
immediately after each commit,"
+ " to archive commits if we cross a maximum value of commits."
+ " It's recommended to enable this, to ensure number of active
commits is bounded.");
+ // Ethan: Is Async archive tested?
Review Comment:
Essential/advanced config tracked in HUDI-5893.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
+ " keep the metadata overhead constant, even as the table size
grows."
+ " This config controls the maximum number of instants to retain in
the active timeline. ");
+ // Ethan: Can we use Spark parallelism or auto-derive the configs and remove
this?
Review Comment:
Tracked in HUDI-5894
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
groupName = ConfigGroups.Names.WRITE_CLIENT,
description = "Configurations that control archival.")
public class HoodieArchivalConfig extends HoodieConfig {
+ //cfg
+ // Ethan: Should we remove this? Archive should always be enabled?
Review Comment:
Tracked in HUDI-5893
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
+ " keep the metadata overhead constant, even as the table size
grows."
+ " This config controls the maximum number of instants to retain in
the active timeline. ");
+ // Ethan: Can we use Spark parallelism or auto-derive the configs and remove
this?
public static final ConfigProperty<Integer>
DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE = ConfigProperty
.key("hoodie.archive.delete.parallelism")
.defaultValue(100)
.withDocumentation("Parallelism for deleting archived hoodie commits.");
+ // Ethan: pick one of "hoodie.keep.min.commits" and
"hoodie.keep.max.commits" only?
Review Comment:
Agree. We can leave this as is.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
.sinceVersion("0.6.0")
.withDocumentation("Class to use for reading the bootstrap dataset
partitions/files, for Bootstrap mode FULL_RECORD");
+ // Ethan: can this be aligned with key gen for write (no need for a separate
bootstrap key gen config)?
Review Comment:
Tracked in HUDI-5895
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -58,12 +61,16 @@ public class HoodieCleanConfig extends HoodieConfig {
.withDocumentation("Only applies when " + AUTO_CLEAN.key() + " is turned
on. "
+ "When turned on runs cleaner async with writing, which can speed
up overall write performance.");
+ // Ethan: when either "hoodie.cleaner.commits.retained",
"hoodie.cleaner.hours.retained",
Review Comment:
Yes. Tracked in HUDI-5892.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -84,11 +91,13 @@ public class HoodieCleanConfig extends HoodieConfig {
.withDocumentation("Controls how cleaning is scheduled. Valid options: "
+
Arrays.stream(CleaningTriggerStrategy.values()).map(Enum::name).collect(Collectors.joining(",")));
+ // Ethan: "hoodie.clean.trigger.max.commits" is what this means
Review Comment:
Tracked in HUDI-5896
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -46,6 +46,7 @@
+ "The bootstrap operation can flexibly avoid copying data over before
you can use Hudi and support running the existing "
+ " writers and new hudi writers in parallel, to validate the
migration.")
public class HoodieBootstrapConfig extends HoodieConfig {
+ // Configs for Bootstrap feature
Review Comment:
Tracked in HUDI-5895
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -41,10 +42,12 @@
*/
@Immutable
@ConfigClassProperty(name = "Clean Configs",
- groupName = ConfigGroups.Names.WRITE_CLIENT,
- description = "Cleaning (reclamation of older/unused file
groups/slices).")
+ groupName = ConfigGroups.Names.WRITE_CLIENT,
+ description = "Cleaning (reclamation of older/unused file groups/slices).")
public class HoodieCleanConfig extends HoodieConfig {
+ // Configs for Clean action
+ // Ethan: should we standardize on `.enable` instead of `.automatic`?
Review Comment:
Tracked in HUDI-5896
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
+ "failed writes to re-claim space. Choose to perform this rollback
of failed writes eagerly before "
+ "every writer starts (only supported for single writer) or lazily
by the cleaner (required for multi-writers)");
+ // Ethan: same here. Can we reuse Spark parallelism or derive automatically?
public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE =
ConfigProperty
.key("hoodie.cleaner.parallelism")
.defaultValue("200")
.withDocumentation("Parallelism for the cleaning operation. Increase
this if cleaning becomes slow.");
+ // Ethan: wondering if this is still needed.
public static final ConfigProperty<Boolean> ALLOW_MULTIPLE_CLEANS =
ConfigProperty
.key("hoodie.clean.allow.multiple")
.defaultValue(true)
.sinceVersion("0.11.0")
.withDocumentation("Allows scheduling/executing multiple cleans by
enabling this config. If users prefer to strictly ensure clean requests should
be mutually exclusive, "
+ ".i.e. a 2nd clean will not be scheduled if another clean is not
yet completed to avoid repeat cleaning of same files, they might want to
disable this config.");
+ // Ethan: should the default behavior be deleting original parquet files and
then we can remove this config?
Review Comment:
Tracked in HUDI-5895
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
+ "DAY_ROLLING: clustering partitions on a rolling basis by the hour
to avoid clustering all partitions each time, "
+ "which strategy sorts the partitions asc and chooses the partition
of which index is divided by 24 and the remainder is equal to the current
hour.");
+ // Ethan: all these fine-tuning knobs can be simplified if we have a config
to indicate the purpose:
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -131,18 +134,21 @@ public class HoodieClusteringConfig extends HoodieConfig {
.withDocumentation("Turn on inline clustering - clustering will be run
after each write operation is complete")
.withAlternatives("hoodie.datasource.clustering.inline.enable");
+ // Ethan: this can apply to both inline or aysnc clustering. Rename it for
better readability.
Review Comment:
Tracked in HUDI-5896
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -47,6 +47,7 @@
description = "Configurations that control the clustering table service in
hudi, "
+ "which optimizes the storage layout for better query performance by
sorting and sizing data files.")
public class HoodieClusteringConfig extends HoodieConfig {
+ // Configs for Clustering operations
Review Comment:
Tracked in HUDI-5897
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -218,6 +229,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
.withDocumentation("Enable running of clustering service, asynchronously
as inserts happen on the table.")
.withAlternatives("hoodie.datasource.clustering.async.enable");
+ // Ethan: this should be removed. User should not tweak this to corrupt the
hoodie_commit_time meta field.
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
description = "Configurations that control compaction "
+ "(merging of log files onto a new base files).")
public class HoodieCompactionConfig extends HoodieConfig {
-
+ // Configs for compaction
+
+ // Ethan: Similar to clustering, the compaction configs should be
straightened up to be more concise
+ // Sth more intuitive like:
+ // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to
schedule compaction plan
+ // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute
compaction plan
+ // // compaction parameters should be independent of inline/async
+ // hoodie.compact.trigger.strategy
+ // hoodie.compact.max.delta.seconds
+ // hoodie.compact.max.delta.commits
+ // ...
+ // instead of many configs that overlap with each other
+ // hoodie.compact.schedule.inline
+ // hoodie.compact.inline
+ // hoodie.datasource.compaction.async.enable
+ // // these are reused for inline/async
+ // hoodie.compact.inline.trigger.strategy
+ // hoodie.compact.inline.max.delta.seconds
+ // hoodie.compact.inline.max.delta.commits
public static final ConfigProperty<String> INLINE_COMPACT = ConfigProperty
.key("hoodie.compact.inline")
.defaultValue("false")
Review Comment:
Tracked in HUDI-5896
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
.withDocumentation("Only applies if index type is BUCKET. Determine the
number of buckets in the hudi table, "
+ "and each partition is divided to N buckets.");
+ // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets":
e.g., min=0.5*num, max=4*num?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
description = "Configurations that control compaction "
+ "(merging of log files onto a new base files).")
public class HoodieCompactionConfig extends HoodieConfig {
-
+ // Configs for compaction
+
+ // Ethan: Similar to clustering, the compaction configs should be
straightened up to be more concise
+ // Sth more intuitive like:
+ // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to
schedule compaction plan
+ // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute
compaction plan
+ // // compaction parameters should be independent of inline/async
+ // hoodie.compact.trigger.strategy
+ // hoodie.compact.max.delta.seconds
+ // hoodie.compact.max.delta.commits
+ // ...
+ // instead of many configs that overlap with each other
Review Comment:
Tracked in HUDI-5898
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -175,6 +198,9 @@ public class HoodieCompactionConfig extends HoodieConfig {
+ " record sizes. It's recommended to keep this turned on, since
hand tuning is otherwise extremely"
+ " cumbersome.");
+ // Ethan: Should this be removed? This should only be used by the first
batch. For first batch,
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
.withDocumentation("Only applies if index type is BUCKET. Determine the
number of buckets in the hudi table, "
+ "and each partition is divided to N buckets.");
+ // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets":
e.g., min=0.5*num, max=4*num?
Review Comment:
@yuzhaojing @minihippo any suggestion on the default value here?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -132,13 +151,15 @@ public class HoodieCompactionConfig extends HoodieConfig {
+ "compaction during each compaction run. By default. Hudi picks the
log file "
+ "with most accumulated unmerged data");
+ // Ethan: this seems to be internal
public static final ConfigProperty<String> COMPACTION_LAZY_BLOCK_READ_ENABLE
= ConfigProperty
.key("hoodie.compaction.lazy.block.read")
.defaultValue("true")
.withDocumentation("When merging the delta log files, this config helps
to choose whether the log blocks "
+ "should be read lazily or not. Choose true to use lazy block
reading (low memory usage, but incurs seeks to each block"
+ " header) or false for immediate block read (higher memory
usage)");
+ // Ethan: this seems to be internal
Review Comment:
The reverse log traversal is introduced by #331. @n3nash could you provide
context around this? Is it still useful? How stable is it based on any
previous production rollout?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieHBaseIndexConfig.java:
##########
@@ -36,6 +36,7 @@
+ "(when HBase based indexing is enabled), which tags incoming "
+ "records as either inserts or updates to older records.")
public class HoodieHBaseIndexConfig extends HoodieConfig {
+ // Configs for HBase Index
Review Comment:
Sg
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -177,6 +179,7 @@ public class HoodieIndexConfig extends HoodieConfig {
+ "When true, the incoming writes will cached to speed up index
lookup by reducing IO "
+ "for computing parallelism or affected partitions");
+ // Ethan: should the parallelim below be dynamically determined by Spark?
Review Comment:
Tracked in HUDI-5894
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
* Configs/params used for internal purposes.
*/
public class HoodieInternalConfig extends HoodieConfig {
+ // Some internal configs (?)
private static final long serialVersionUID = 0L;
+ // Ethan: why do we need this? BulkInsertPartitioner can already indicate
`arePartitionRecordsSorted()`
public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED =
"hoodie.bulkinsert.are.partitioner.records.sorted";
public static final Boolean
DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
+ // Ethan: why do we need this?
Review Comment:
@nsivabalan I see this config is introduced by #2328 for bulk insert which
you reviewed. Do you know if this is still needed?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java:
##########
@@ -38,13 +38,15 @@
groupName = ConfigGroups.Names.SPARK_DATASOURCE,
description = "The following set of configurations help validate new data
before commits.")
public class HoodiePreCommitValidatorConfig extends HoodieConfig {
+ // Configs for precommit validator
public static final ConfigProperty<String> VALIDATOR_CLASS_NAMES =
ConfigProperty
.key("hoodie.precommit.validators")
.defaultValue("")
.withDocumentation("Comma separated list of class names that can be
invoked to validate commit");
public static final String VALIDATOR_TABLE_VARIABLE = "<TABLE_NAME>";
+ // Ethan: it seems that these SQL query configs are not used? ALso, could
we have one config for all SQL queries?
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -177,6 +181,9 @@ public class HoodieWriteConfig extends HoodieConfig {
+ "this queue has no need for additional memory and cpu resources
due to lock or multithreading, but also lost some benefits such as speed limit.
"
+ "Although DISRUPTOR is still experimental.");
+ // Ethan: Should this be automatically determined based on the partition and
record key column(s)?
+ // e.g., single column -> simple key gen, multiple column -> complex key gen
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -449,6 +469,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("Table services such as compaction and clustering can
fail and prevent syncing to "
+ "the metaclient. Set this to true to fail writes when table
services fail");
+ // Ethan: do we still need all of the consistency check logic given major
cloud storage are now strongly consistent?
Review Comment:
Tracked in HUDI-5899
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -154,6 +161,7 @@ public class HoodieLockConfig extends HoodieConfig {
.sinceVersion("0.8.0")
.withDocumentation("Timeout in ms, to wait for establishing connection
with Zookeeper.");
+ // Ethan: remove ZK_PORT so that ZK_CONNECT_URL also includes the port
number?
Review Comment:
Sg
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("When true, spins up an instance of the timeline
server (meta server that serves cached file listings, statistics),"
+ "running on each writer's driver process, accepting requests
during the write from executors.");
+ // Ethan: is this still useful?
Review Comment:
Tracked in HUDI-5892 for now
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLayoutConfig.java:
##########
@@ -37,12 +38,15 @@
description = "Configurations that control storage layout and data
distribution, "
+ "which defines how the files are organized within a table.")
public class HoodieLayoutConfig extends HoodieConfig {
+ // Configs for Hudi storage layer
+ // Ethan: should this be a table config? I assume this cannot be changed
once determined
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -139,6 +141,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("Field used in preCombining before actual write. When
two records have the same key value, "
+ "we will pick the one with the largest value for the precombine
field, determined by Object.compareTo(..)");
+ // Ethan: regarding the namespace, should we use "hoodie.write." instead of
"hoodie.datasource.write." as the prefix?
Review Comment:
Tracked in HUDI-5896
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -102,6 +111,7 @@ public class HoodieCleanConfig extends HoodieConfig {
+ " in the timeline, since the last cleaner run. This is much more
efficient than obtaining listings for the full"
+ " table for each planning (even with a metadata table).");
+ // Ethan: this should be an internal config, based on concurrency control
mode. Should not be tweaked by user.
Review Comment:
Tracked in HUDI-5893
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -111,11 +116,12 @@ public class HoodieLockConfig extends HoodieConfig {
.sinceVersion("0.8.0")
.withDocumentation("For DFS based lock providers, path to store the
locks under. use Table's meta path as default");
+ // Ethan: should the default also allow expiration? Need to dig into the
code to understand what this controls.
public static final ConfigProperty<Integer> FILESYSTEM_LOCK_EXPIRE =
ConfigProperty
- .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
- .defaultValue(0)
- .sinceVersion("0.12.0")
- .withDocumentation("For DFS based lock providers, expire time in
minutes, must be a nonnegative number, default means no expire");
+ .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
+ .defaultValue(0)
+ .sinceVersion("0.12.0")
+ .withDocumentation("For DFS based lock providers, expire time in
minutes, must be a nonnegative number, default means no expire");
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
.sinceVersion("0.11.0")
.withDocumentation("Control to enable release all persist rdds when the
spark job finish.");
+ // Ethan: shall we remove this and do auto adjustment everywhere?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
* Configs/params used for internal purposes.
*/
public class HoodieInternalConfig extends HoodieConfig {
+ // Some internal configs (?)
private static final long serialVersionUID = 0L;
+ // Ethan: why do we need this? BulkInsertPartitioner can already indicate
`arePartitionRecordsSorted()`
public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED =
"hoodie.bulkinsert.are.partitioner.records.sorted";
public static final Boolean
DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
+ // Ethan: why do we need this?
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -62,11 +64,14 @@ public class HoodieMemoryConfig extends HoodieConfig {
// Minimum memory size (100MB) for the spillable map.
public static final long DEFAULT_MIN_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES = 100
* 1024 * 1024L;
+ // Ethan: Should this be automatically configured as a fraction of memory
Review Comment:
Sg
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("When enabled, data validation checks are performed
during merges to ensure expected "
+ "number of records after merge operation.");
+ // Ethan: should "hoodie.combine.before.insert" controls the behavior this
one intends to control as well?
Review Comment:
Tracked in HUDI-5892 for now
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("When true, spins up an instance of the timeline
server (meta server that serves cached file listings, statistics),"
+ "running on each writer's driver process, accepting requests
during the write from executors.");
+ // Ethan: is this still useful?
Review Comment:
@bvaradar should we remove all relevant logic on reusing the timeline server?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("When enabled, data validation checks are performed
during merges to ensure expected "
+ "number of records after merge operation.");
+ // Ethan: should "hoodie.combine.before.insert" controls the behavior this
one intends to control as well?
public static final ConfigProperty<String>
MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE = ConfigProperty
.key("hoodie.merge.allow.duplicate.on.inserts")
.defaultValue("false")
.withDocumentation("When enabled, we allow duplicate keys even if
inserts are routed to merge with an existing file (for ensuring file sizing)."
+ " This is only relevant for insert operation, since upsert, delete
operations will ensure unique key constraints are maintained.");
+ // Ethan: should we make the default 0, to reduce the write amplication and
make MOR write fast?
Review Comment:
Sg
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -212,6 +220,7 @@ public class HoodieWriteConfig extends HoodieConfig {
+ "Hudi stores all the main meta-data about commits, savepoints,
cleaning audit logs "
+ "etc in .hoodie directory under this base path directory.");
+ // Ethan: is this still necessary?
Review Comment:
Tracked in HUDI-5893
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
.sinceVersion("0.11.0")
.withDocumentation("Control to enable release all persist rdds when the
spark job finish.");
+ // Ethan: shall we remove this and do auto adjustment everywhere?
Review Comment:
@nsivabalan let's discuss this too
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("When enabled, data validation checks are performed
during merges to ensure expected "
+ "number of records after merge operation.");
+ // Ethan: should "hoodie.combine.before.insert" controls the behavior this
one intends to control as well?
Review Comment:
I'll go through the code. @nsivabalan do you have any concern around this?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -66,14 +66,18 @@
areCommonConfigs = true,
description = "")
public class HoodieLockConfig extends HoodieConfig {
+ // Configs for common and legacy lock providers
+ // Ethan: nit: this is initial delay
public static final ConfigProperty<String>
LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS = ConfigProperty
.key(LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY)
.defaultValue(DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS)
.sinceVersion("0.8.0")
.withDocumentation("Initial amount of time to wait between retries to
acquire locks, "
+ " subsequent retries will exponentially backoff.");
+ // Ethan: default should be larger? like 16s? we already have exponential
backoff. sth like 1 -> 2 -> 4 -> 8 -> 16 -> 16 ...
+ // currently, retry times is 15. The change increases the retry period from
~1min to ~4min.
Review Comment:
Tracked in HUDI-5782
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -185,6 +192,7 @@ public class HoodieWriteConfig extends HoodieConfig {
+ "**Note** This is being actively worked on. Please use "
+ "`hoodie.datasource.write.keygenerator.class` instead.");
+ // Ethan: look like we can remove this as it is mature?
Review Comment:
Tracked in HUDI-5782. Likely we'll leave this alone.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -53,12 +56,13 @@ public class HoodiePayloadConfig extends HoodieConfig {
.withDocumentation("Table column/field name to derive timestamp
associated with the records. This can"
+ "be useful for e.g, determining the freshness of the table.");
+ // Ethan: remove this and align with the write payload class
("hoodie.datasource.write.payload.class")?
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -310,6 +321,7 @@ public class HoodieWriteConfig extends HoodieConfig {
.withDocumentation("When inserted records share same key, controls
whether they should be first combined (i.e de-duplicated) before"
+ " writing to storage.");
+ // Ethan: should we have one "hoodie.combine.before.write" for all change
operations instead of one config per operation?
Review Comment:
Sg. Keep them as the same.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -40,7 +40,10 @@
description = "Payload related configs, that can be leveraged to "
+ "control merges based on specific business fields in the data.")
public class HoodiePayloadConfig extends HoodieConfig {
+ // Configs for Hudi payload class
+ // Ethan: shall we use preCombine
("hoodie.datasource.write.precombine.field") only and remove this?
+ // It acts as the same purpose as the precombine field. This confuses user
too.
Review Comment:
Tracked in HUDI-5892
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -259,6 +263,7 @@ public class HoodieIndexConfig extends HoodieConfig {
+ "Consistent hashing supports dynamic resizing of the number of
bucket, solving potential data skew and file size "
+ "issues of the SIMPLE hashing engine. Consistent hashing only
works with MOR tables, only use simple hashing on COW tables.");
+ // Ethan: is the default too large?
Review Comment:
Tracked in HUDI-5782
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]