vinothchandar commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r619575338
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##########
@@ -59,6 +60,8 @@
public static final String HOODIE_TABLE_VERSION_PROP_NAME =
"hoodie.table.version";
public static final String HOODIE_TABLE_PRECOMBINE_FIELD =
"hoodie.table.precombine.field";
public static final String HOODIE_TABLE_PARTITION_COLUMNS =
"hoodie.table.partition.columns";
+ public static final String HOODIE_TABLE_ROWKEY_FIELDS =
"hoodie.table.rowkey.fields";
+ public static final String HOODIE_TABLE_SCHEMA = "hoodie.table.schema";
Review comment:
rename: `hoodie.table.create.schema` to be clear that this is the schema
at creation.
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
##########
@@ -591,6 +591,8 @@ public static PropertyBuilder withPropertyBuilder() {
private HoodieTableType tableType;
private String tableName;
+ private String tableSchema;
Review comment:
just calling out that these fields are only set and used by the sql
tables. if someone has an existing Hudi table, that they want to operate using
SQL now. Do we need some code in `UpgradeDowngrade` ? Something to think about
and at-least file a new JIRA for.
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##########
@@ -174,7 +174,8 @@ public boolean commitStats(String instantTime,
List<HoodieWriteStat> stats, Opti
String commitActionType, Map<String,
List<String>> partitionToReplaceFileIds) {
// Create a Hoodie table which encapsulated the commits and files visible
HoodieTable table = createTable(config, hadoopConf);
- HoodieCommitMetadata metadata = CommitUtils.buildMetadata(stats,
partitionToReplaceFileIds, extraMetadata, operationType, config.getSchema(),
commitActionType);
+ HoodieCommitMetadata metadata = CommitUtils.buildMetadata(stats,
partitionToReplaceFileIds,
+ extraMetadata, operationType, config.getWriteSchema(),
commitActionType);
Review comment:
Below you mentioned that `writeSchema` will be not the same as
inputSchema right? I think inputSchema is what will be equal to the table
schema. no?
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##########
@@ -59,6 +60,8 @@
public static final String HOODIE_TABLE_VERSION_PROP_NAME =
"hoodie.table.version";
public static final String HOODIE_TABLE_PRECOMBINE_FIELD =
"hoodie.table.precombine.field";
public static final String HOODIE_TABLE_PARTITION_COLUMNS =
"hoodie.table.partition.columns";
+ public static final String HOODIE_TABLE_ROWKEY_FIELDS =
"hoodie.table.rowkey.fields";
Review comment:
rename: `hoodie.table.recordkey.fields` to consistent with `HoodieKey`
terminology
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/HoodieLazyInsertIterable.java
##########
@@ -42,6 +43,8 @@
public abstract class HoodieLazyInsertIterable<T extends HoodieRecordPayload>
extends LazyIterableIterator<HoodieRecord<T>, List<WriteStatus>> {
+ private static final Properties EMPTY_PROPERTIES = new Properties();
Review comment:
Why don't we actually use `Option.empty`? If you really want to do this.
can you move this to `CollectionUtils`
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
##########
@@ -192,7 +193,7 @@ protected void initializeIncomingRecordsMap() {
long memoryForMerge =
IOUtils.getMaxMemoryPerPartitionMerge(taskContextSupplier, config.getProps());
LOG.info("MaxMemoryPerPartitionMerge => " + memoryForMerge);
this.keyToNewRecords = new ExternalSpillableMap<>(memoryForMerge,
config.getSpillableMapBasePath(),
- new DefaultSizeEstimator(), new
HoodieRecordSizeEstimator(writerSchema));
+ new DefaultSizeEstimator(), new
HoodieRecordSizeEstimator(inputSchema));
Review comment:
Understood the need for `writeSchema`, we are on the same page there.
On the `inputSchema`, is this always equal to the table's schema? Right now,
the table's schema is persisted in each commit metadata file, but during each
write user can also supply a schema using `hoodie.avro.schema`. Is
`hoodie.avro.schema` what you are calling as inputSchema?
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java
##########
@@ -212,7 +230,11 @@ public Schema
getTableAvroSchemaWithoutMetadataFields(HoodieInstant instant) thr
*/
private Option<Schema> getTableSchemaFromCommitMetadata(boolean
includeMetadataFields) {
HoodieTimeline timeline =
metaClient.getActiveTimeline().getCommitsTimeline().filterCompletedInstants();
- return getTableSchemaFromCommitMetadata(timeline.lastInstant().get(),
includeMetadataFields);
+ if (timeline.lastInstant().isPresent()) {
Review comment:
just `return timeline.lastInstant().map(i ->
getTableSchemaFromCommitMetadata(timeline.lastInstant().get(),
includeMetadataFields))` ?
##########
File path:
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/UuidKeyGenerator.java
##########
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.UUID;
+import java.util.stream.Collectors;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+
+/**
+ * A KeyGenerator which use the uuid as the record key.
+ */
+public class UuidKeyGenerator extends BuiltinKeyGenerator {
Review comment:
Could we make this key be prefixed by commit time? this will result in
ordered keys, which will improve range/index lookup for update/deletes down the
line. Random UUID ranges will always overlap each other, so given a key `K` ,
it's very likely that the this is true for almost all files written like this
=> `minKey < K < maxKey`.
Actually, `_hoodie_commit_seqno` would be a good key for example for insert
only datasets. Can we generate something like that instead of uuid, which has
poor ordering.
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
##########
@@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) {
return this;
}
+ public PropertyBuilder setTableSchema(String tableSchema) {
Review comment:
Okay got it. makes sense.
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java
##########
@@ -97,4 +86,22 @@ public DefaultHoodieRecordPayload(Option<GenericRecord>
record) {
}
return metadata.isEmpty() ? Option.empty() : Option.of(metadata);
}
+
+ protected boolean needUpdatePersistedRecord(IndexedRecord currentValue,
Review comment:
rename: `needUpdatingPersistedRecord()`
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
##########
@@ -617,6 +619,16 @@ public PropertyBuilder setTableName(String tableName) {
return this;
}
+ public PropertyBuilder setTableSchema(String tableSchema) {
+ this.tableSchema = tableSchema;
+ return this;
+ }
+
+ public PropertyBuilder setRowKeyFields(String rowKeyFields) {
Review comment:
You are right. What I meant was, if we can persist the key generator
class used in general. but I think this is okay.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]