[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358057459
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/TimelineLayout.java
 ##
 @@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table;
+
+import org.apache.hudi.common.model.TimelineLayoutVersion;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.collection.Pair;
+
+import java.io.Serializable;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+
+/**
+ * Timeline Layout responsible for applying specific filters when generating 
timeline instants.
+ */
+public abstract class TimelineLayout implements Serializable {
+
+  private static final Map LAYOUT_MAP = 
new HashMap<>();
+
+  static {
+LAYOUT_MAP.put(new TimelineLayoutVersion(TimelineLayoutVersion.VERSION_0), 
new TimelineLayoutV0());
+LAYOUT_MAP.put(new TimelineLayoutVersion(TimelineLayoutVersion.VERSION_1), 
new TimelineLayoutV1());
+  }
+
+  public static TimelineLayout getLayout(TimelineLayoutVersion version) {
+return LAYOUT_MAP.get(version);
+  }
+
+  public abstract Stream 
filterHoodieInstants(Stream instantStream);
+
+  /**
+   * Table Layout where state transitions are managed by renaming files.
+   */
+  private static class TimelineLayoutV0 extends TimelineLayout {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358011854
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -414,23 +436,41 @@ public String getCommitActionType() {
   /**
* Helper method to scan all hoodie-instant metafiles and construct 
HoodieInstant objects.
*
-   * @param fs FileSystem
-   * @param metaPath Meta Path where hoodie instants are present
* @param includedExtensions Included hoodie extensions
+   * @param applyLayoutVersionFilters Depending on Timeline layout version, if 
there are multiple states for the same
+   * action instant, only include the highest state
* @return List of Hoodie Instants generated
* @throws IOException in case of failure
*/
-  public static List 
scanHoodieInstantsFromFileSystem(FileSystem fs, Path metaPath,
-  Set includedExtensions) throws IOException {
-return Arrays.stream(HoodieTableMetaClient.scanFiles(fs, metaPath, path -> 
{
-  // Include only the meta files with extensions that needs to be included
-  String extension = FSUtils.getFileExtension(path.getName());
-  return includedExtensions.contains(extension);
-})).sorted(Comparator.comparing(
-// Sort the meta-data by the instant time (first part of the file name)
-fileStatus -> FSUtils.getInstantTime(fileStatus.getPath().getName(
-// create HoodieInstantMarkers from FileStatus, which extracts 
properties
-.map(HoodieInstant::new).collect(Collectors.toList());
+  public List scanHoodieInstantsFromFileSystem(Set 
includedExtensions,
+  boolean applyLayoutVersionFilters) throws IOException {
+return scanHoodieInstantsFromFileSystem(new Path(metaPath), 
includedExtensions, applyLayoutVersionFilters);
+  }
+
+  /**
+   * Helper method to scan all hoodie-instant metafiles and construct 
HoodieInstant objects.
+   *
+   * @param timelinePath MetaPath where instant files are stored
+   * @param includedExtensions Included hoodie extensions
+   * @param applyLayoutVersionFilters Depending on Timeline layout version, if 
there are multiple states for the same
+   * action instant, only include the highest state
+   * @return List of Hoodie Instants generated
+   * @throws IOException in case of failure
+   */
+  public List scanHoodieInstantsFromFileSystem(Path 
timelinePath, Set includedExtensions,
+  boolean applyLayoutVersionFilters) throws IOException {
+Stream instantStream = Arrays.stream(
+HoodieTableMetaClient
+.scanFiles(getFs(), timelinePath, path -> {
+  // Include only the meta files with extensions that needs to be 
included
+  String extension = FSUtils.getFileExtension(path.getName());
+  return includedExtensions.contains(extension);
+})).map(HoodieInstant::new);
+
+if (applyLayoutVersionFilters) {
+  instantStream = 
TimelineLayout.getLayout(getTimelineLayoutVersion()).filterHoodieInstants(instantStream);
 
 Review comment:
   The key case here is for archival where we need all instants without 
filtering. May be introduce couple of factory methods which instantiate 
HoodieActiveTimeline w/o filtering ?  HUDI-414


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358009213
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
 ##
 @@ -112,6 +116,10 @@ public static void createHoodieProperties(FileSystem fs, 
Path metadataFolder, Pr
   if (!properties.containsKey(HOODIE_ARCHIVELOG_FOLDER_PROP_NAME)) {
 properties.setProperty(HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
DEFAULT_ARCHIVELOG_FOLDER);
   }
+  if (!properties.containsKey(HOODIE_TIMELINE_LAYOUT_VERSION)) {
 
 Review comment:
   The version in hoodie.properties is treated as default configuration. 
MetaClient Writers override this config with version-1 (unless they opt out). 
MetaClient readers will always use latest version (version - 1) which is 
backwards compatible. Keeping a version in hoodie.properties would also 
hopefully be useful in future migrations


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007869
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -274,25 +286,35 @@ public synchronized HoodieArchivedTimeline 
getArchivedTimeline() {
*/
   public static HoodieTableMetaClient initTableType(Configuration hadoopConf, 
String basePath, String tableType,
   String tableName, String archiveLogFolder) throws IOException {
-HoodieTableType type = HoodieTableType.valueOf(tableType);
-Properties properties = new Properties();
-properties.put(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, tableName);
-properties.put(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, type.name());
-properties.put(HoodieTableConfig.HOODIE_ARCHIVELOG_FOLDER_PROP_NAME, 
archiveLogFolder);
-return HoodieTableMetaClient.initDatasetAndGetMetaClient(hadoopConf, 
basePath, properties);
+return initTableType(hadoopConf, basePath, 
HoodieTableType.valueOf(tableType), tableName,
+archiveLogFolder, null, null);
   }
 
   /**
* Helper method to initialize a given path, as a given storage type and 
table name.
*/
   public static HoodieTableMetaClient initTableType(Configuration hadoopConf, 
String basePath,
   HoodieTableType tableType, String tableName, String payloadClassName) 
throws IOException {
+return initTableType(hadoopConf, basePath, tableType, tableName, null, 
payloadClassName, null);
+  }
+
+  public static HoodieTableMetaClient initTableType(Configuration hadoopConf, 
String basePath,
+  HoodieTableType tableType, String tableName, String archiveLogFolder, 
String payloadClassName,
+  Integer timelineLayoutVersion) throws IOException {
 Properties properties = new Properties();
 properties.setProperty(HoodieTableConfig.HOODIE_TABLE_NAME_PROP_NAME, 
tableName);
 properties.setProperty(HoodieTableConfig.HOODIE_TABLE_TYPE_PROP_NAME, 
tableType.name());
-if (tableType == HoodieTableType.MERGE_ON_READ) {
+if ((tableType == HoodieTableType.MERGE_ON_READ) && (payloadClassName != 
null)) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007840
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc
 ##
 @@ -37,6 +37,7 @@
  "default": null
   },
   {
+ /** DEPRECATED **/
 
 Review comment:
   Yeah, This is was intended to be equivalent of HoodieCommitMetadata but 
since we are using HoodieCommitMetadata for storing results of compaction, this 
is never used. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007852
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/TimelineLayoutVersion.java
 ##
 @@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.google.common.base.Preconditions;
+
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Metadata Layout Version. Add new version when timeline format changes
+ */
+public class TimelineLayoutVersion implements Serializable, 
Comparable {
+
+  public static final Integer VERSION_0 = 0; // pre 0.5.1  version format
+  public static final Integer VERSION_1 = 1; // current version with no renames
+
+  public static final Integer CURR_VERSION = VERSION_1;
+  public static final TimelineLayoutVersion LATEST_TIMELINE_LAYOUT_VERSION = 
new TimelineLayoutVersion(CURR_VERSION);
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007673
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc
 ##
 @@ -74,6 +75,27 @@
  "name":"version",
  "type":["int", "null"],
  "default": 1
+  },
+  {
+ "name":"hoodieCompactionPlan",
+ "type":[
+"null",
+"HoodieCompactionPlan"
+ ],
+ "default": null
+  },
+  {
+ "name":"hoodieCleanerPlan",
+ "type":[
+"null",
+"HoodieCleanerPlan"
+ ],
+ "default": null
+  },
+  {
+ "name":"actionState",
 
 Review comment:
   Yes, both actionType and instantTime are already available. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007647
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -733,6 +746,11 @@ public HoodieWriteConfig build() {
   setDefaultOnCondition(props, !isConsistencyGuardSet,
   ConsistencyGuardConfig.newBuilder().fromProperties(props).build());
 
+  String layoutVersion = 
props.getProperty(OVERRIDDEN_TIMELINE_LAYOUT_VERSION);
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007621
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -48,6 +49,7 @@
 public class HoodieWriteConfig extends DefaultHoodieConfig {
 
   public static final String TABLE_NAME = "hoodie.table.name";
+  private static final String OVERRIDDEN_TIMELINE_LAYOUT_VERSION = 
"hoodie.timeline.layout.version";
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007569
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
 ##
 @@ -32,7 +35,24 @@
  *
  * @see HoodieTimeline
  */
-public class HoodieInstant implements Serializable {
+public class HoodieInstant implements Serializable, Comparable {
+
+  /**
+   * A COMPACTION action eventually becomes COMMIT when completed. So, when 
grouping instants
+   * for state transitions, this needs to be taken into account
+   */
+  private static final Map COMPARABLE_ACTIONS = new 
ImmutableMap.Builder()
+  .put(HoodieTimeline.COMPACTION_ACTION, 
HoodieTimeline.COMMIT_ACTION).build();
+
+  public static final Comparator ACTION_COMPARATOR =
+  Comparator.comparing(instant -> 
getCompatibleAction(instant.getAction()));
+
+  public static final Comparator COMPARATOR = 
Comparator.comparing(HoodieInstant::getTimestamp)
+  .thenComparing(ACTION_COMPARATOR).thenComparing(HoodieInstant::getState);
+
+  public static final String getCompatibleAction(String action) {
 
 Review comment:
   Picked Comparable. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007553
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -792,23 +798,25 @@ public void restoreToInstant(final String instantTime) 
throws HoodieRollbackExce
 .filter(instant -> 
HoodieActiveTimeline.GREATER.test(instant.getTimestamp(), instantTime))
 .collect(Collectors.toList());
 // Start a rollback instant for all commits to be rolled back
-String startRollbackInstant = startInstant();
+String startRollbackInstant = HoodieActiveTimeline.createNewCommitTime();
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007581
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -61,11 +62,11 @@
 
   public static final SimpleDateFormat COMMIT_FORMATTER = new 
SimpleDateFormat("MMddHHmmss");
 
-  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE =
-  new HashSet<>(Arrays.asList(new String[] {COMMIT_EXTENSION, 
INFLIGHT_COMMIT_EXTENSION, DELTA_COMMIT_EXTENSION,
-  INFLIGHT_DELTA_COMMIT_EXTENSION, SAVEPOINT_EXTENSION, 
INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION,
-  INFLIGHT_CLEAN_EXTENSION, REQUESTED_CLEAN_EXTENSION, 
INFLIGHT_COMPACTION_EXTENSION,
-  REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, 
RESTORE_EXTENSION}));
+  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(
+  new String[]{COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, 
REQUESTED_COMMIT_EXTENTSION, DELTA_COMMIT_EXTENSION,
+  INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENTSION, 
SAVEPOINT_EXTENSION,
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007546
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/DatasetsCommand.java
 ##
 @@ -85,6 +86,8 @@ public String createTable(
   @CliOption(key = {"tableName"}, mandatory = true, help = "Hoodie Table 
Name") final String name,
   @CliOption(key = {"tableType"}, unspecifiedDefaultValue = 
"COPY_ON_WRITE",
   help = "Hoodie Table Type. Must be one of : COPY_ON_WRITE or 
MERGE_ON_READ") final String tableTypeStr,
+  @CliOption(key = {"archiveLogFolder"}, help = "Folder Name for storing 
archived timeline") String archiveFolder,
+  @CliOption(key = {"layoutVersion"}, help = "Specific Layouyt Version to 
use") Integer layoutVersion,
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-15 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r358007597
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -61,11 +62,11 @@
 
   public static final SimpleDateFormat COMMIT_FORMATTER = new 
SimpleDateFormat("MMddHHmmss");
 
-  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE =
-  new HashSet<>(Arrays.asList(new String[] {COMMIT_EXTENSION, 
INFLIGHT_COMMIT_EXTENSION, DELTA_COMMIT_EXTENSION,
-  INFLIGHT_DELTA_COMMIT_EXTENSION, SAVEPOINT_EXTENSION, 
INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION,
-  INFLIGHT_CLEAN_EXTENSION, REQUESTED_CLEAN_EXTENSION, 
INFLIGHT_COMPACTION_EXTENSION,
-  REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, 
RESTORE_EXTENSION}));
+  public static final Set VALID_EXTENSIONS_IN_ACTIVE_TIMELINE = new 
HashSet<>(Arrays.asList(
+  new String[]{COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, 
REQUESTED_COMMIT_EXTENTSION, DELTA_COMMIT_EXTENSION,
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-10 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r356340283
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -824,14 +843,14 @@ private String startInstant() {
   "Found commits after time :" + lastCommit + ", please rollback 
greater commits first");
 }
 
-List inflights =
-
inflightCommitTimeline.getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toList());
+List inflights = 
inflightAndRequestedCommitTimeline.getInstants().map(HoodieInstant::getTimestamp)
 
 Review comment:
   Yes, it should work fine. For restoring (reverting completed) compactions, 
we would still delete log files as part of first rolling-back the inflight 
state of the compaction before we try to rollback requested state. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354245375
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
 ##
 @@ -30,7 +34,23 @@
  *
  * @see HoodieTimeline
  */
-public class HoodieInstant implements Serializable {
+public class HoodieInstant implements Serializable, Comparable {
+
+  /**
+   * A COMPACTION action eventually becomes COMMIT when completed. So, when 
grouping instants
+   * for state transitions, this needs to be taken into account
+   */
+  private static final Set comparableActions = Arrays.stream(new 
String[] { HoodieTimeline.COMMIT_ACTION,
 
 Review comment:
   Added a jira to track this : https://jira.apache.org/jira/browse/HUDI-384


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354243394
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -312,8 +332,11 @@ public HoodieInstant 
transitionCompactionInflightToComplete(HoodieInstant inflig
   }
 
   private void createFileInAuxiliaryFolder(HoodieInstant instant, 
Option data) {
-Path fullPath = new Path(metaClient.getMetaAuxiliaryPath(), 
instant.getFileName());
-createFileInPath(fullPath, data);
+if (metaClient.getMetadataVersion().isNullVersion()) {
+  // No need for auxiliary folder for newer versions
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354241852
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -212,25 +224,24 @@ public void saveAsComplete(HoodieInstant instant, 
Option data) {
 log.info("Completed " + instant);
   }
 
-  public void revertToInflight(HoodieInstant instant) {
-log.info("Reverting " + instant + " to inflight ");
-revertStateTransition(instant, HoodieTimeline.getInflightInstant(instant));
-log.info("Reverted " + instant + " to inflight");
-  }
-
-  public HoodieInstant revertToRequested(HoodieInstant instant) {
-log.warn("Reverting " + instant + " to requested ");
-HoodieInstant requestedInstant = 
HoodieTimeline.getRequestedInstant(instant);
-revertStateTransition(instant, 
HoodieTimeline.getRequestedInstant(instant));
-log.warn("Reverted " + instant + " to requested");
-return requestedInstant;
+  public HoodieInstant revertToInflight(HoodieInstant instant) {
+log.info("Reverting instant to inflight " + instant);
+HoodieInstant inflight = HoodieTimeline.getInflightInstant(instant, 
metaClient.getTableType());
+revertCompleteToInflight(instant, inflight);
 
 Review comment:
   Deletes are atomic in GCS but not so in S3. Thought about this. It shouldn't 
bring in new failure scenarios here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354242118
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -260,8 +271,13 @@ private void deleteInstantFile(HoodieInstant instant) {
 
   /** BEGIN - COMPACTION RELATED META-DATA MANAGEMENT **/
 
-  public Option getInstantAuxiliaryDetails(HoodieInstant instant) {
-Path detailPath = new Path(metaClient.getMetaAuxiliaryPath(), 
instant.getFileName());
+  public Option getPlanDetailsInBytes(HoodieInstant instant) {
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354240466
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTimeline.java
 ##
 @@ -250,7 +253,10 @@ static HoodieInstant getCompactionInflightInstant(final 
String timestamp) {
 return new HoodieInstant(State.INFLIGHT, COMPACTION_ACTION, timestamp);
   }
 
-  static HoodieInstant getInflightInstant(final HoodieInstant instant) {
+  static HoodieInstant getInflightInstant(final HoodieInstant instant, final 
HoodieTableType tableType) {
+if ((tableType == HoodieTableType.MERGE_ON_READ) && 
instant.getAction().equals(COMMIT_ACTION)) {
 
 Review comment:
   Added


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354240802
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -151,17 +174,6 @@ public HoodieTimeline getDeltaCommitTimeline() {
 (Function> & Serializable) 
this::getInstantDetails);
   }
 
-  /**
 
 Review comment:
   Reverted.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354236756
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -420,17 +427,50 @@ public String getCommitActionType() {
* @return List of Hoodie Instants generated
* @throws IOException in case of failure
*/
-  public static List 
scanHoodieInstantsFromFileSystem(FileSystem fs, Path metaPath,
-  Set includedExtensions) throws IOException {
-return Arrays.stream(HoodieTableMetaClient.scanFiles(fs, metaPath, path -> 
{
-  // Include only the meta files with extensions that needs to be included
-  String extension = FSUtils.getFileExtension(path.getName());
-  return includedExtensions.contains(extension);
-})).sorted(Comparator.comparing(
-// Sort the meta-data by the instant time (first part of the file name)
-fileStatus -> FSUtils.getInstantTime(fileStatus.getPath().getName(
-// create HoodieInstantMarkers from FileStatus, which extracts 
properties
-.map(HoodieInstant::new).collect(Collectors.toList());
+  public static List scanHoodieInstantsFromFileSystem(
+  FileSystem fs, Path metaPath, Set includedExtensions) throws 
IOException {
+return scanHoodieInstantsFromFileSystem(fs, metaPath, includedExtensions, 
true);
+  }
+
+  /**
+   * Helper method to scan all hoodie-instant metafiles and construct 
HoodieInstant objects
+   *
+   * @param fs FileSystem
+   * @param metaPath   Meta Path where hoodie instants are present
+   * @param includedExtensions Included hoodie extensions
+   * @param excludeIntermediateStates If there are multiple states for the 
same action instant,
+   *  only include the highest state
+   * @return List of Hoodie Instants generated
+   * @throws IOException in case of failure
+   */
+  public static List scanHoodieInstantsFromFileSystem(
+  FileSystem fs, Path metaPath, Set includedExtensions, boolean 
excludeIntermediateStates)
+  throws IOException {
+Stream instantStream = Arrays.stream(
+HoodieTableMetaClient
+.scanFiles(fs, metaPath, path -> {
+  // Include only the meta files with extensions that needs to be 
included
+  String extension = FSUtils.getFileExtension(path.getName());
+  return includedExtensions.contains(extension);
+})).map(HoodieInstant::new);
+
+if (excludeIntermediateStates) {
+  // Remove intermediate states for each (ts,action) pair
+  instantStream = dedupeInstants(instantStream);
+}
+return instantStream.sorted().collect(Collectors.toList());
+  }
+
+  public static Stream dedupeInstants(Stream 
instantStream) {
+return instantStream.collect(Collectors.groupingBy(x -> 
Pair.of(x.getTimestamp(),
+x.getAction().equals(HoodieTimeline.COMPACTION_ACTION) ? 
HoodieTimeline.COMMIT_ACTION : x.getAction(
+.entrySet().stream().map(e -> e.getValue().stream().reduce((x, y) -> {
 
 Review comment:
   Fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354235768
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -420,17 +427,50 @@ public String getCommitActionType() {
* @return List of Hoodie Instants generated
* @throws IOException in case of failure
*/
-  public static List 
scanHoodieInstantsFromFileSystem(FileSystem fs, Path metaPath,
-  Set includedExtensions) throws IOException {
-return Arrays.stream(HoodieTableMetaClient.scanFiles(fs, metaPath, path -> 
{
-  // Include only the meta files with extensions that needs to be included
-  String extension = FSUtils.getFileExtension(path.getName());
-  return includedExtensions.contains(extension);
-})).sorted(Comparator.comparing(
-// Sort the meta-data by the instant time (first part of the file name)
-fileStatus -> FSUtils.getInstantTime(fileStatus.getPath().getName(
-// create HoodieInstantMarkers from FileStatus, which extracts 
properties
-.map(HoodieInstant::new).collect(Collectors.toList());
+  public static List scanHoodieInstantsFromFileSystem(
+  FileSystem fs, Path metaPath, Set includedExtensions) throws 
IOException {
+return scanHoodieInstantsFromFileSystem(fs, metaPath, includedExtensions, 
true);
+  }
+
+  /**
+   * Helper method to scan all hoodie-instant metafiles and construct 
HoodieInstant objects
+   *
+   * @param fs FileSystem
+   * @param metaPath   Meta Path where hoodie instants are present
+   * @param includedExtensions Included hoodie extensions
+   * @param excludeIntermediateStates If there are multiple states for the 
same action instant,
+   *  only include the highest state
+   * @return List of Hoodie Instants generated
+   * @throws IOException in case of failure
+   */
+  public static List scanHoodieInstantsFromFileSystem(
+  FileSystem fs, Path metaPath, Set includedExtensions, boolean 
excludeIntermediateStates)
+  throws IOException {
+Stream instantStream = Arrays.stream(
+HoodieTableMetaClient
+.scanFiles(fs, metaPath, path -> {
+  // Include only the meta files with extensions that needs to be 
included
+  String extension = FSUtils.getFileExtension(path.getName());
+  return includedExtensions.contains(extension);
+})).map(HoodieInstant::new);
+
+if (excludeIntermediateStates) {
+  // Remove intermediate states for each (ts,action) pair
+  instantStream = dedupeInstants(instantStream);
+}
+return instantStream.sorted().collect(Collectors.toList());
+  }
+
+  public static Stream dedupeInstants(Stream 
instantStream) {
 
 Review comment:
   I moved this logic to TableLayout. Let me know if interface needs more 
changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354232466
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieMetadataVersion.java
 ##
 @@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.google.common.base.Preconditions;
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Metadata Layout Version. Add new version when timeline format changes
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354232343
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -108,16 +110,25 @@ private void close() {
*/
   public boolean archiveIfRequired(final JavaSparkContext jsc) throws 
IOException {
 try {
-  List instantsToArchive = 
getInstantsToArchive(jsc).collect(Collectors.toList());
+  List instantsToDelete = 
getInstantsToArchive(jsc).collect(Collectors.toList());
+  List instantsToArchive = instantsToDelete.stream()
+  .filter(HoodieInstant::isCompleted).collect(Collectors.toList());
+
+  Preconditions.checkArgument(instantsToArchive.isEmpty() == 
instantsToDelete.isEmpty(),
 
 Review comment:
   Made code changes to allow pending instants also be archived. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354231831
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -342,60 +365,94 @@ public HoodieInstant 
transitionCleanInflightToComplete(HoodieInstant inflightIns
* Transition Clean State from requested to inflight
*
* @param requestedInstant requested instant
+   * @param data Optional data to be stored
* @return commit instant
*/
-  public HoodieInstant transitionCleanRequestedToInflight(HoodieInstant 
requestedInstant) {
+  public HoodieInstant transitionCleanRequestedToInflight(HoodieInstant 
requestedInstant, Option data) {
 
Preconditions.checkArgument(requestedInstant.getAction().equals(HoodieTimeline.CLEAN_ACTION));
 Preconditions.checkArgument(requestedInstant.isRequested());
 HoodieInstant inflight = new HoodieInstant(State.INFLIGHT, CLEAN_ACTION, 
requestedInstant.getTimestamp());
-transitionState(requestedInstant, inflight, Option.empty());
+transitionState(requestedInstant, inflight, data);
 return inflight;
   }
 
 
   private void transitionState(HoodieInstant fromInstant, HoodieInstant 
toInstant, Option data) {
 
Preconditions.checkArgument(fromInstant.getTimestamp().equals(toInstant.getTimestamp()));
-Path commitFilePath = new Path(metaClient.getMetaPath(), 
toInstant.getFileName());
 try {
-  // Re-create the .inflight file by opening a new file and write the 
commit metadata in
-  Path inflightCommitFile = new Path(metaClient.getMetaPath(), 
fromInstant.getFileName());
-  createFileInMetaPath(fromInstant.getFileName(), data);
-  boolean success = metaClient.getFs().rename(inflightCommitFile, 
commitFilePath);
-  if (!success) {
-throw new HoodieIOException("Could not rename " + inflightCommitFile + 
" to " + commitFilePath);
+  if (metaClient.getMetadataVersion().isNullVersion()) {
+// Re-create the .inflight file by opening a new file and write the 
commit metadata in
+createFileInMetaPath(fromInstant.getFileName(), data, false);
+Path fromInstantPath = new Path(metaClient.getMetaPath(), 
fromInstant.getFileName());
+Path toInstantPath = new Path(metaClient.getMetaPath(), 
toInstant.getFileName());
+boolean success = metaClient.getFs().rename(fromInstantPath, 
toInstantPath);
+if (!success) {
+  throw new HoodieIOException("Could not rename " + fromInstantPath + 
" to " + toInstantPath);
+}
+  } else {
+// Ensures old state exists in timeline
+System.out.println("Checking for file exists ?" + new 
Path(metaClient.getMetaPath(),
+fromInstant.getFileName()));
+Preconditions.checkArgument(metaClient.getFs().exists(new 
Path(metaClient.getMetaPath(),
+fromInstant.getFileName(;
+// Use Write Once to create Target File
+writeFileOnceInPath(new Path(metaClient.getMetaPath(), 
toInstant.getFileName()), data);
+System.out.println("Create new file for toInstant ?" + new 
Path(metaClient.getMetaPath(), toInstant.getFileName()));
   }
 } catch (IOException e) {
   throw new HoodieIOException("Could not complete " + fromInstant, e);
 }
   }
 
-  private void revertStateTransition(HoodieInstant curr, HoodieInstant revert) 
{
-
Preconditions.checkArgument(curr.getTimestamp().equals(revert.getTimestamp()));
-Path revertFilePath = new Path(metaClient.getMetaPath(), 
revert.getFileName());
+  private void revertCompleteToInflight(HoodieInstant completed, HoodieInstant 
inflight) {
+
Preconditions.checkArgument(completed.getTimestamp().equals(inflight.getTimestamp()));
+Path inFlightCommitFilePath = new Path(metaClient.getMetaPath(), 
inflight.getFileName());
+Path commitFilePath = new Path(metaClient.getMetaPath(), 
completed.getFileName());
 try {
-  if (!metaClient.getFs().exists(revertFilePath)) {
-Path currFilePath = new Path(metaClient.getMetaPath(), 
curr.getFileName());
-boolean success = metaClient.getFs().rename(currFilePath, 
revertFilePath);
-if (!success) {
-  throw new HoodieIOException("Could not rename " + currFilePath + " 
to " + revertFilePath);
+  if (metaClient.getMetadataVersion().isNullVersion()) {
+if (!metaClient.getFs().exists(inFlightCommitFilePath)) {
+  boolean success = metaClient.getFs().rename(commitFilePath, 
inFlightCommitFilePath);
+  if (!success) {
+throw new HoodieIOException(
+"Could not rename " + commitFilePath + " to " + 
inFlightCommitFilePath);
+  }
 }
-log.info("Renamed " + currFilePath + " to " + revertFilePath);
+  } else {
+Path requestedInstantFilePath = new 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354230170
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -433,6 +494,32 @@ private void createFileInPath(Path fullPath, 
Option content) {
 }
   }
 
+  /**
+   * Creates a new file in timeline with overwrite set to false. This ensures
+   * files are created only once and never rewritten
+   * @param fullPath File Path
+   * @param content Content to be stored
+   */
+  private void writeFileOnceInPath(Path fullPath, Option content) {
 
 Review comment:
   @vinothchandar : Currently, we employ overwrite semantics for inflight 
commit/delta-commit files. We first create an empty inflight file and then 
overwrite it later with workload profile data. I can try cleaning that up in 
hoodie write client.  Let me know your thoughts ?
   
   @n3nash : Renamed the method. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354216985
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -342,60 +365,94 @@ public HoodieInstant 
transitionCleanInflightToComplete(HoodieInstant inflightIns
* Transition Clean State from requested to inflight
*
* @param requestedInstant requested instant
+   * @param data Optional data to be stored
* @return commit instant
*/
-  public HoodieInstant transitionCleanRequestedToInflight(HoodieInstant 
requestedInstant) {
+  public HoodieInstant transitionCleanRequestedToInflight(HoodieInstant 
requestedInstant, Option data) {
 
Preconditions.checkArgument(requestedInstant.getAction().equals(HoodieTimeline.CLEAN_ACTION));
 Preconditions.checkArgument(requestedInstant.isRequested());
 HoodieInstant inflight = new HoodieInstant(State.INFLIGHT, CLEAN_ACTION, 
requestedInstant.getTimestamp());
-transitionState(requestedInstant, inflight, Option.empty());
+transitionState(requestedInstant, inflight, data);
 return inflight;
   }
 
 
   private void transitionState(HoodieInstant fromInstant, HoodieInstant 
toInstant, Option data) {
 
Preconditions.checkArgument(fromInstant.getTimestamp().equals(toInstant.getTimestamp()));
-Path commitFilePath = new Path(metaClient.getMetaPath(), 
toInstant.getFileName());
 try {
-  // Re-create the .inflight file by opening a new file and write the 
commit metadata in
-  Path inflightCommitFile = new Path(metaClient.getMetaPath(), 
fromInstant.getFileName());
-  createFileInMetaPath(fromInstant.getFileName(), data);
-  boolean success = metaClient.getFs().rename(inflightCommitFile, 
commitFilePath);
-  if (!success) {
-throw new HoodieIOException("Could not rename " + inflightCommitFile + 
" to " + commitFilePath);
+  if (metaClient.getMetadataVersion().isNullVersion()) {
+// Re-create the .inflight file by opening a new file and write the 
commit metadata in
+createFileInMetaPath(fromInstant.getFileName(), data, false);
+Path fromInstantPath = new Path(metaClient.getMetaPath(), 
fromInstant.getFileName());
+Path toInstantPath = new Path(metaClient.getMetaPath(), 
toInstant.getFileName());
+boolean success = metaClient.getFs().rename(fromInstantPath, 
toInstantPath);
+if (!success) {
+  throw new HoodieIOException("Could not rename " + fromInstantPath + 
" to " + toInstantPath);
+}
+  } else {
+// Ensures old state exists in timeline
+System.out.println("Checking for file exists ?" + new 
Path(metaClient.getMetaPath(),
 
 Review comment:
   Thanks. Fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354216781
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -420,17 +427,50 @@ public String getCommitActionType() {
* @return List of Hoodie Instants generated
* @throws IOException in case of failure
*/
-  public static List 
scanHoodieInstantsFromFileSystem(FileSystem fs, Path metaPath,
-  Set includedExtensions) throws IOException {
-return Arrays.stream(HoodieTableMetaClient.scanFiles(fs, metaPath, path -> 
{
-  // Include only the meta files with extensions that needs to be included
-  String extension = FSUtils.getFileExtension(path.getName());
-  return includedExtensions.contains(extension);
-})).sorted(Comparator.comparing(
-// Sort the meta-data by the instant time (first part of the file name)
-fileStatus -> FSUtils.getInstantTime(fileStatus.getPath().getName(
-// create HoodieInstantMarkers from FileStatus, which extracts 
properties
-.map(HoodieInstant::new).collect(Collectors.toList());
+  public static List scanHoodieInstantsFromFileSystem(
+  FileSystem fs, Path metaPath, Set includedExtensions) throws 
IOException {
+return scanHoodieInstantsFromFileSystem(fs, metaPath, includedExtensions, 
true);
+  }
+
+  /**
+   * Helper method to scan all hoodie-instant metafiles and construct 
HoodieInstant objects
+   *
+   * @param fs FileSystem
+   * @param metaPath   Meta Path where hoodie instants are present
+   * @param includedExtensions Included hoodie extensions
+   * @param excludeIntermediateStates If there are multiple states for the 
same action instant,
+   *  only include the highest state
+   * @return List of Hoodie Instants generated
+   * @throws IOException in case of failure
+   */
+  public static List scanHoodieInstantsFromFileSystem(
+  FileSystem fs, Path metaPath, Set includedExtensions, boolean 
excludeIntermediateStates)
+  throws IOException {
+Stream instantStream = Arrays.stream(
+HoodieTableMetaClient
+.scanFiles(fs, metaPath, path -> {
+  // Include only the meta files with extensions that needs to be 
included
+  String extension = FSUtils.getFileExtension(path.getName());
+  return includedExtensions.contains(extension);
+})).map(HoodieInstant::new);
+
+if (excludeIntermediateStates) {
+  // Remove intermediate states for each (ts,action) pair
+  instantStream = dedupeInstants(instantStream);
+}
+return instantStream.sorted().collect(Collectors.toList());
+  }
+
+  public static Stream dedupeInstants(Stream 
instantStream) {
+return instantStream.collect(Collectors.groupingBy(x -> 
Pair.of(x.getTimestamp(),
+x.getAction().equals(HoodieTimeline.COMPACTION_ACTION) ? 
HoodieTimeline.COMMIT_ACTION : x.getAction(
 
 Review comment:
   Refactored to reuse this logic.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r350891309
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -731,6 +739,12 @@ public boolean rollbackToSavepoint(String savepointTime) {
* file,
*/
   public boolean rollback(final String commitTime) throws 
HoodieRollbackException {
+try {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351017008
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieMetadataVersion.java
 ##
 @@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.google.common.base.Preconditions;
+import java.io.Serializable;
+import java.util.Objects;
+
+/**
+ * Metadata Layout Version. Add new version when timeline format changes
+ */
+public class HoodieMetadataVersion implements Serializable, 
Comparable {
+
+  public static Integer VERSION_0 = 0; // pre 0.5.1  version format
+  public static Integer VERSION_1 = 1; // current version with no renames
+
+  public static Integer CURR_VERSION = VERSION_1;
+
+  private Integer version;
+
+  public HoodieMetadataVersion(Integer version) {
+Preconditions.checkArgument(version <= CURR_VERSION);
+Preconditions.checkArgument(version >= VERSION_0);
+this.version = version;
+  }
+
+  /**
+   * For Pre 0.5.1 release, there was no metadata version. This method is used 
to detect
+   * this case.
+   * @return
+   */
+  public boolean isNullVersion() {
+return Objects.equals(version, VERSION_0);
+  }
+
+  public Integer getVersion() {
+return version;
+  }
+
+  @Override
+  public boolean equals(Object o) {
+if (this == o) {
 
 Review comment:
   If all the callers do use Objects.equals(), then we should be ok. But, then 
there are still equals calls in Collections (e:g HashMap) where they use 
obj1.equals(obj2).  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354204782
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTimeline.java
 ##
 @@ -250,7 +253,10 @@ static HoodieInstant getCompactionInflightInstant(final 
String timestamp) {
 return new HoodieInstant(State.INFLIGHT, COMPACTION_ACTION, timestamp);
   }
 
-  static HoodieInstant getInflightInstant(final HoodieInstant instant) {
+  static HoodieInstant getInflightInstant(final HoodieInstant instant, final 
HoodieTableType tableType) {
 
 Review comment:
   Thought about keeping in metaclient but the current structure is such that 
all file-name generation depending on state is present in HoodieTimeline. 
Metaclient APIs are currently  one level higher dealing with timelines, table 
and others. So, leaving it this way.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r350897448
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -192,6 +210,7 @@ private boolean deleteArchivedInstants(List 
archivedInstants) thr
   return i.isCompleted() && 
(i.getAction().equals(HoodieTimeline.COMMIT_ACTION)
   || (i.getAction().equals(HoodieTimeline.DELTA_COMMIT_ACTION)));
 }).max(Comparator.comparing(HoodieInstant::getTimestamp)));
+log.info("Last Committed Instant =" + latestCommitted);
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351014321
 
 

 ##
 File path: hudi-client/src/test/java/org/apache/hudi/TestClientRollback.java
 ##
 @@ -217,10 +217,12 @@ public void testRollbackCommit() throws Exception {
 
   // simulate partial failure, where .inflight was not deleted, but data 
files were.
   HoodieTestUtils.createInflightCommitFiles(basePath, commitTime3);
+  Thread.sleep(1000);
 
 Review comment:
   Modified rollback() code to use the non-conflicting creatNewCommitTime API 
and also removed these lines


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354196674
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -345,6 +346,9 @@ public static SparkConf registerClasses(SparkConf conf) {
 final List fileIDPrefixes =
 IntStream.range(0, parallelism).mapToObj(i -> 
FSUtils.createNewFileIdPfx()).collect(Collectors.toList());
 
+table.getActiveTimeline().transitionRequestedToInflight(new 
HoodieInstant(State.REQUESTED,
 
 Review comment:
   Agree. This would make state transitions more manageable. Filed a jira to 
track this as a follow-up : https://jira.apache.org/jira/browse/HUDI-383


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351022082
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
 ##
 @@ -30,7 +34,23 @@
  *
  * @see HoodieTimeline
  */
-public class HoodieInstant implements Serializable {
+public class HoodieInstant implements Serializable, Comparable {
+
+  /**
+   * A COMPACTION action eventually becomes COMMIT when completed. So, when 
grouping instants
+   * for state transitions, this needs to be taken into account
+   */
+  private static final Set comparableActions = Arrays.stream(new 
String[] { HoodieTimeline.COMMIT_ACTION,
+  HoodieTimeline.COMPACTION_ACTION}).collect(Collectors.toSet());
+
+  public static final Comparator comparator = 
Comparator.comparing(HoodieInstant::getTimestamp)
 
 Review comment:
   Enums have a natural ordering defined by their ordinal. In our case, State 
is an enum and Requested < Inflight < Completed 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354200362
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -896,17 +919,22 @@ private void finishRestore(final Timer.Context context, 
Map stats = doRollbackAndGetStats(commitToRollback);
-  Map> statToCommit = new HashMap<>();
-  finishRollback(context, stats, Arrays.asList(commitToRollback), 
startRollbackTime);
+  // Create a Hoodie table which encapsulated the commits and files visible
+  HoodieTable table = HoodieTable.getHoodieTable(
+  createMetaClient(true), config, jsc);
+  List rollbackInstants = 
table.getActiveTimeline().getCommitsTimeline().getInstants()
+  .filter(instant -> 
HoodieActiveTimeline.EQUAL.test(instant.getTimestamp(), commitToRollback))
+  .collect(Collectors.toList());
+  if (!rollbackInstants.isEmpty()) {
+Preconditions.checkArgument(rollbackInstants.size() == 1);
+List stats = 
doRollbackAndGetStats(rollbackInstants.get(0));
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r350889000
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieCleanClient.java
 ##
 @@ -92,7 +92,7 @@ protected HoodieCleanMetadata clean(String startCleanTime) 
throws HoodieIOExcept
   if ((cleanerPlan.getFilesToBeDeletedPerPartition() != null)
   && !cleanerPlan.getFilesToBeDeletedPerPartition().isEmpty()) {
 final HoodieTable hoodieTable = 
HoodieTable.getHoodieTable(createMetaClient(true), config, jsc);
-return runClean(hoodieTable, startCleanTime);
+return runClean(hoodieTable, 
HoodieTimeline.getCleanRequestedInstant(startCleanTime), cleanerPlan);
 
 Review comment:
   No, it only creates a Requested instant. The loop earlier in the same 
function will take care of inflight clean instants


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351019450
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
 ##
 @@ -125,6 +133,13 @@ public HoodieTableType getTableType() {
 return DEFAULT_TABLE_TYPE;
   }
 
+  public HoodieMetadataVersion getTableVersion() {
+if (props.containsKey(HOODIE_TABLE_VERSION_PROP_NAME)) {
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351017850
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
 ##
 @@ -52,12 +53,15 @@
   public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type";
   public static final String HOODIE_RO_FILE_FORMAT_PROP_NAME = 
"hoodie.table.ro.file.format";
   public static final String HOODIE_RT_FILE_FORMAT_PROP_NAME = 
"hoodie.table.rt.file.format";
+  public static final String HOODIE_TABLE_VERSION_PROP_NAME = 
"hoodie.table.version";
   public static final String HOODIE_PAYLOAD_CLASS_PROP_NAME = 
"hoodie.compaction.payload.class";
   public static final String HOODIE_ARCHIVELOG_FOLDER_PROP_NAME = 
"hoodie.archivelog.folder";
 
   public static final HoodieTableType DEFAULT_TABLE_TYPE = 
HoodieTableType.COPY_ON_WRITE;
   public static final HoodieFileFormat DEFAULT_RO_FILE_FORMAT = 
HoodieFileFormat.PARQUET;
   public static final HoodieFileFormat DEFAULT_RT_FILE_FORMAT = 
HoodieFileFormat.HOODIE_LOG;
+  public static final Integer DEFAULT_TABLE_VERSION = 
HoodieMetadataVersion.VERSION_0;
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351017736
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
 ##
 @@ -52,12 +53,15 @@
   public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type";
   public static final String HOODIE_RO_FILE_FORMAT_PROP_NAME = 
"hoodie.table.ro.file.format";
   public static final String HOODIE_RT_FILE_FORMAT_PROP_NAME = 
"hoodie.table.rt.file.format";
+  public static final String HOODIE_TABLE_VERSION_PROP_NAME = 
"hoodie.table.version";
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r354211539
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java
 ##
 @@ -92,15 +91,15 @@ public HoodieTimeline filterInflightsAndRequested() {
   }
 
   @Override
-  public HoodieTimeline filterInflightsExcludingCompaction() {
+  public HoodieTimeline filterPendingExcludingCompaction() {
 return new HoodieDefaultTimeline(instants.stream().filter(instant -> {
-  return instant.isInflight() && 
(!instant.getAction().equals(HoodieTimeline.COMPACTION_ACTION));
+  return (!instant.isCompleted()) && 
(!instant.getAction().equals(HoodieTimeline.COMPACTION_ACTION));
 
 Review comment:
   Yes, that was the intention. One of the places where we are using is rolling 
back pending commits


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351012476
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -169,7 +180,14 @@ public boolean archiveIfRequired(final JavaSparkContext 
jsc) throws IOException
   }).limit(commitTimeline.countInstants() - minCommitsToKeep));
 }
 
-return instants;
+// For archiving and cleaning instants, we need to include intermediate 
state files if they exist
+HoodieActiveTimeline rawActiveTimeline = new 
HoodieActiveTimeline(metaClient, false);
+Map, List> groupByTsAction = 
rawActiveTimeline.getInstants()
+.collect(Collectors.groupingBy(x -> Pair.of(x.getTimestamp(),
+x.getAction().equals(HoodieTimeline.COMPACTION_ACTION) ? 
HoodieTimeline.COMMIT_ACTION : x.getAction(;
+
+return instants.flatMap(hoodieInstant ->
 
 Review comment:
   we would want to archive all types of actions eventually. So, keeping it 
this way.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351012174
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -169,7 +180,14 @@ public boolean archiveIfRequired(final JavaSparkContext 
jsc) throws IOException
   }).limit(commitTimeline.countInstants() - minCommitsToKeep));
 }
 
-return instants;
+// For archiving and cleaning instants, we need to include intermediate 
state files if they exist
+HoodieActiveTimeline rawActiveTimeline = new 
HoodieActiveTimeline(metaClient, false);
+Map, List> groupByTsAction = 
rawActiveTimeline.getInstants()
+.collect(Collectors.groupingBy(x -> Pair.of(x.getTimestamp(),
+x.getAction().equals(HoodieTimeline.COMPACTION_ACTION) ? 
HoodieTimeline.COMMIT_ACTION : x.getAction(;
 
 Review comment:
   refactored a bit to be used here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r350895583
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -169,7 +180,14 @@ public boolean archiveIfRequired(final JavaSparkContext 
jsc) throws IOException
   }).limit(commitTimeline.countInstants() - minCommitsToKeep));
 }
 
-return instants;
+// For archiving and cleaning instants, we need to include intermediate 
state files if they exist
+HoodieActiveTimeline rawActiveTimeline = new 
HoodieActiveTimeline(metaClient, false);
+Map, List> groupByTsAction = 
rawActiveTimeline.getInstants()
+.collect(Collectors.groupingBy(x -> Pair.of(x.getTimestamp(),
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r351015157
 
 

 ##
 File path: hudi-client/src/test/java/org/apache/hudi/index/TestHbaseIndex.java
 ##
 @@ -168,13 +169,19 @@ public void testTagLocationAndDuplicateUpdate() throws 
Exception {
 HoodieWriteConfig config = getConfig();
 HBaseIndex index = new HBaseIndex(config);
 HoodieWriteClient writeClient = new HoodieWriteClient(jsc, config);
-writeClient.startCommit();
+writeClient.startCommitWithTime(newCommitTime);
 metaClient = HoodieTableMetaClient.reload(metaClient);
 HoodieTable hoodieTable = HoodieTable.getHoodieTable(metaClient, config, 
jsc);
 
 JavaRDD writeStatues = writeClient.upsert(writeRecords, 
newCommitTime);
 JavaRDD javaRDD1 = index.tagLocation(writeRecords, jsc, 
hoodieTable);
+
 // Duplicate upsert and ensure correctness is maintained
+// We are trying to approximately imitate the case when the RRD is 
recomputed. For RRD creating, driver code is not
 
 Review comment:
   Thanks @vinothchandar  Linkedin days :) 
   
   @n3nash : Sure, We can discuss this f2f
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r350888183
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CompactionCommand.java
 ##
 @@ -96,14 +96,15 @@ public String compactionsAll(
   if (!instant.getAction().equals(HoodieTimeline.COMPACTION_ACTION)) {
 try {
   // This could be a completed compaction. Assume a compaction request 
file is present but skip if fails
-  workload = AvroUtils.deserializeCompactionPlan(activeTimeline
-  
.getInstantAuxiliaryDetails(HoodieTimeline.getCompactionRequestedInstant(instant.getTimestamp())).get());
+  workload = AvroUtils.deserializeCompactionPlan(
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1009: [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

2019-12-05 Thread GitBox
bvaradar commented on a change in pull request #1009:  [HUDI-308] Avoid Renames 
for tracking state transitions of all actions on dataset
URL: https://github.com/apache/incubator-hudi/pull/1009#discussion_r350892589
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -824,14 +843,14 @@ private String startInstant() {
   "Found commits after time :" + lastCommit + ", please rollback 
greater commits first");
 }
 
-List inflights =
-
inflightCommitTimeline.getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toList());
+List inflights = 
inflightAndRequestedCommitTimeline.getInstants().map(HoodieInstant::getTimestamp)
 
 Review comment:
   Yes, For requested instant, the rollback is essentially essentially removing 
the instant file as no side-effect has happened


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services