[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r260172472 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5225,7 +5226,8 @@ private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc des for (String dbName : Utils.matchesDb(hive, dbNameOrPattern)) { Database database = hive.getMSC().getDatabase(dbName); parameters = database.getParameters(); -if (ReplUtils.isFirstIncPending(parameters)) { +String incPendPara = parameters != null ? parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null; +if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) { Review comment: changed to ReplRemoveFirstIncLoadPendFlagDesc ..now just remove if the property is set This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r260134323 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5216,7 +5216,8 @@ private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc des for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, tableNameOrPattern)) { org.apache.hadoop.hive.metastore.api.Table tbl = hive.getMSC().getTable(dbNameOrPattern, tableName); parameters = tbl.getParameters(); -if (ReplUtils.isFirstIncPending(parameters)) { +String incPendPara = parameters != null ? parameters.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG) : null; +if (incPendPara != null && (!flag.equalsIgnoreCase(incPendPara))) { Review comment: no ..as per current code ..we change this value only we are going to change it to other one ..else no need to change ..as this is called only after incremental is done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259820997 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Task preCursor) { HashMap mapProp = new HashMap<>(); -mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); -AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new ReplicationSpec(replState, replState)); +// if the update is for incLoadPendFlag, then send replicationSpec as null to avoid replacement check. +ReplicationSpec replicationSpec = null; +if (incLoadPendFlag == null) { + mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); + replicationSpec = new ReplicationSpec(replState, replState); +} else { + assert replState == null; + mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag); Review comment: done ..added check same as ckpt flag This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259804317 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) { return retval; } + private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc desc) throws HiveException, TException { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259803151 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1147,6 +1145,12 @@ private static void createReplImportTasks( if (!waitOnPrecursor){ throw new SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName())); } + // For warehouse level replication, if the database itself is getting created in this load, then no need to + // check for duplicate copy. Check HIVE-21197 for more detail. + firstIncPending = false; +} else { + // For database replication, get the flag from database parameter. Check HIVE-21197 for more detail. + firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters()); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259803182 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259803223 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -48,13 +48,15 @@ private final DatabaseEvent event; private final String dbNameToLoadIn; + private final boolean isTableLevelLoad; - public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, + public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, String tblNameToLoadIn, Review comment: yes This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259802388 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: changed the logic to avoid duplicate check in case number of base directory is more than one This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259802015 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -5199,6 +5205,35 @@ public static boolean doesTableNeedLocation(Table tbl) { return retval; } + private int updateFirstIncPendingFlag(Hive hive, ReplSetFirstIncLoadFlagDesc desc) throws HiveException, TException { +String dbNameOrPattern = desc.getDatabaseName(); +String tableNameOrPattern = desc.getTableName(); +String flag = desc.getIncLoadPendingFlag() ? "true" : "false"; +Map parameters; +// For database level load tableNameOrPattern will be null. Flag is set only in database for db level load. +if (tableNameOrPattern != null && !tableNameOrPattern.isEmpty()) { + // For table level load, dbNameOrPattern is db name and not a pattern. + for (String tableName : Utils.matchesTbl(hive, dbNameOrPattern, tableNameOrPattern)) { +org.apache.hadoop.hive.metastore.api.Table tbl = hive.getMSC().getTable(dbNameOrPattern, tableName); +parameters = tbl.getParameters(); +if (ReplUtils.isFirstIncPending(parameters)) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259800518 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259800145 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259800197 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncPending(Map parameter) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259799886 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + boolean incLoadPendingFlag; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, boolean incLoadPendingFlag) { +super(); +this.databaseName = databaseName; +this.tableName = tableName; +this.incLoadPendingFlag = incLoadPendingFlag; + } + + @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getDatabaseName() { +return databaseName; + } + + public void setDatabaseName(String databaseName) { +this.databaseName = databaseName; + } + + @Explain(displayName="table_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getTableName() { +return tableName; + } + + public void setTableName(String tableName) { +this.tableName = tableName; + } + + @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) Review comment: no change This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259799813 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259778574 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: "These table level checks can be done only if work.tableNameOrPattern is valid." with this change ..the call will be only once This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259778574 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: "We invoke getTable twice. Can we reuse tableTuple that we get below inside try block." with this change ..the call will be only once This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259759960 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,273 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { +HiveConf conf = new HiveConf(TestReplicationWithTableMigrationEx.class); +conf.set("dfs.client.use.datanode.hostname", "true"); +conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + ".hosts", "*"); +MiniDFSCluster miniDFSCluster = +new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build(); +final DistributedFileSystem fs = miniDFSCluster.getFileSystem(); +HashMap hiveConfigs = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.support.concurrency", "true"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager"); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.strict.managed.tables", "true"); + put("hive.metastore.transactional.event.listeners", ""); +}}; +replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs); + +HashMap configsForPrimary = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.support.concurrency", "false"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager"); + put("hive.strict.managed.tables", "false"); +}}; +configsForPrimary.putAll(overrideConfigs); +primary = new WarehouseInstance(LOG, miniDFSCluster,
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259759565 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplDumpTask.java ## @@ -274,6 +283,19 @@ Long bootStrapDump(Path dumpRoot, DumpMetaData dmd, Path cmRoot, Hive hiveDb) th for (String tblName : Utils.matchesTbl(hiveDb, dbName, work.tableNameOrPattern)) { LOG.debug( "analyzeReplDump dumping table: " + tblName + " to db root " + dbRoot.toUri()); + Table table; + try { +table = hiveDb.getTable(dbName, tblName); Review comment: anyways just extra check ..so does not impact perf This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259758817 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,14 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncPending(Map parameter) { Review comment: parameter is generic enough ..i think keeping it is not an issue This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259758025 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259757727 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: i think it wont make much diff This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259757650 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + boolean incLoadPendingFlag; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName, boolean incLoadPendingFlag) { +super(); +this.databaseName = databaseName; +this.tableName = tableName; +this.incLoadPendingFlag = incLoadPendingFlag; + } + + @Explain(displayName="db_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getDatabaseName() { +return databaseName; + } + + public void setDatabaseName(String databaseName) { +this.databaseName = databaseName; + } + + @Explain(displayName="table_name", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) + public String getTableName() { +return tableName; + } + + public void setTableName(String tableName) { +this.tableName = tableName; + } + + @Explain(displayName="inc load pending flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) Review comment: can be used for explain repl load ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259756959 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: table level replication ...create table event should not come This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259756728 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1164,6 +1168,9 @@ private static void createReplImportTasks( if (x.getEventType() == DumpType.EVENT_CREATE_TABLE) { dropTblTask = dropTableTask(table, x, replicationSpec); table = null; + } else if (!firstIncPending) { +// For table level replication, get the flag from table parameter. Check HIVE-21197 for more detail. +firstIncPending = ReplUtils.isFirstIncPending(table.getParameters()); Review comment: !firstIncPending will make sure that its obtained for table level only This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259756217 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -48,13 +48,15 @@ private final DatabaseEvent event; private final String dbNameToLoadIn; + private final boolean isTableLevelLoad; - public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, + public LoadDatabase(Context context, DatabaseEvent event, String dbNameToLoadIn, String tblNameToLoadIn, Review comment: its causing null pointer exception if db load is not done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259756442 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1147,6 +1145,12 @@ private static void createReplImportTasks( if (!waitOnPrecursor){ throw new SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName())); } + // For warehouse level replication, if the database itself is getting created in this load, then no need to + // check for duplicate copy. Check HIVE-21197 for more detail. + firstIncPending = false; +} else { + // For database replication, get the flag from database parameter. Check HIVE-21197 for more detail. + firstIncPending = ReplUtils.isFirstIncPending(parentDb.getParameters()); Review comment: anyways it parent db it will not be set ..so automatically it will fetch from table This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622991 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java ## @@ -112,6 +118,12 @@ public void run() { continue; } + if (replIsCompactionDisabledForTable(t)) { Review comment: not done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622969 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) { if (work.getAlterMaterializedViewDesc() != null) { return alterMaterializedView(db, work.getAlterMaterializedViewDesc()); } + + if (work.getReplSetFirstIncLoadFlagDesc() != null) { Review comment: its done in a separate task to make it simpler This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622407 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -164,6 +165,12 @@ public IncrementalLoadTasksBuilder(String dbName, String tableName, String loadP lastEventid); } } + + ReplSetFirstIncLoadFlagDesc desc = new ReplSetFirstIncLoadFlagDesc(dbName, tableName); Review comment: new ddl task is created to make it simpler with table level and warehouse level support This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622413 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws HiveException { // Add the checkpoint key to the Database binding it to current dump directory. // So, if retry using same dump, we shall skip Database object update. parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory); + +if (needSetIncFlag) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622357 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -135,7 +135,8 @@ private boolean isDbEmpty(String dbName) throws HiveException { } private Task alterDbTask(Database dbObj) { -return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory), context.hiveConf); +return alterDbTask(dbObj.getName(), updateDbProps(dbObj, context.dumpDirectory, false), Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622274 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622236 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622204 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: table level is taken care of now This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622210 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCompactor.java ## @@ -1536,6 +1536,62 @@ public void testCompactionInfoHashCode() { Assert.assertEquals("The hash codes must be equal", compactionInfo.hashCode(), compactionInfo1.hashCode()); } + @Test + public void testDisableCompactionDuringReplLoad() throws Exception { +String tblName = "discomp"; +String database = "discomp_db"; +executeStatementOnDriver("drop database if exists " + database + " cascade", driver); +executeStatementOnDriver("create database " + database, driver); +executeStatementOnDriver("CREATE TABLE " + database + "." + tblName + "(a INT, b STRING) " + +" PARTITIONED BY(ds string)" + +" CLUSTERED BY(a) INTO 2 BUCKETS" + //currently ACID requires table to be bucketed +" STORED AS ORC TBLPROPERTIES ('transactional'='true')", driver); +executeStatementOnDriver("insert into " + database + "." + tblName + " partition (ds) values (1, 'fred', " + +"'today'), (2, 'wilma', 'yesterday')", driver); + +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'true')", driver); +List compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'true')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + +" SET TBLPROPERTIES ( 'hive.repl.first.inc.pending' = 'false')", driver); +compacts = getCompactionList(); +Assert.assertEquals(0, compacts.size()); + +executeStatementOnDriver("alter database " + database + +" set dbproperties ('hive.repl.first.inc.pending' = 'false')", driver); +executeStatementOnDriver("ALTER TABLE " + database + "." + tblName + Review comment: table level is taken care of now This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622190 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +299,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r25960 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java ## @@ -71,6 +74,16 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { +try { + Database database = rs.getDatabase(getDefaultCatalog(conf), dbName); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259622183 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -370,6 +370,9 @@ private int executeIncrementalLoad(DriverContext driverContext) { // If incremental events are already applied, then check and perform if need to bootstrap any tables. if (!builder.hasMoreWork() && !work.getPathsToCopyIterator().hasNext()) { +// No need to set incremental load pending flag for external tables as the files will be copied to the same path Review comment: todo not required as table level load is taken care now This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259621882 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Review comment: the code is removed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259621844 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/incremental/IncrementalLoadTasksBuilder.java ## @@ -289,12 +296,21 @@ private boolean shouldReplayEvent(FileStatus dir, DumpType dumpType, String dbNa return updateReplIdTask; } - private Task dbUpdateReplStateTask(String dbName, String replState, + private Task dbUpdateReplStateTask(String dbName, String replState, String incLoadPendFlag, Task preCursor) { HashMap mapProp = new HashMap<>(); -mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); -AlterDatabaseDesc alterDbDesc = new AlterDatabaseDesc(dbName, mapProp, new ReplicationSpec(replState, replState)); +// if the update is for incLoadPendFlag, then send replicationSpec as null to avoid replacement check. +ReplicationSpec replicationSpec = null; +if (incLoadPendFlag == null) { + mapProp.put(ReplicationSpec.KEY.CURR_STATE_ID.toString(), replState); + replicationSpec = new ReplicationSpec(replState, replState); +} else { + assert replState == null; + mapProp.put(ReplUtils.REPL_FIRST_INC_PENDING_FLAG, incLoadPendFlag); Review comment: done. Dump will fail if the inc pending flag is set to true This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259618838 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259618842 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259617900 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,21 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + if (dstFs.exists(filePath)) { Review comment: the i/o exception retry case is handled specifically at 2 places only. there are many other i/o failure scenarios which are not handled. I think its not required here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606409 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +192,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); +return compFlag == null || compFlag.isEmpty() || "false".equalsIgnoreCase(compFlag); Review comment: i think better to have this check as for this case null and empty has special meaning This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606337 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationEx.java ## @@ -0,0 +1,213 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigrationEx - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationEx { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationEx.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { Review comment: no This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259606331 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ReplicationSpec.java ## @@ -426,4 +427,14 @@ public static void copyLastReplId(Map srcParameter, Map
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259310998 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259314021 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -133,8 +161,11 @@ protected int execute(DriverContext driverContext) { return 6; } long writeId = Long.parseLong(writeIdString); - toPath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(work.getDeleteDestIfExist(), writeId, writeId, - driverContext.getCtx().getHiveTxnManager().getStmtIdAndIncrement())); + // Set stmt id 0 for bootstrap load as the directory needs to be searched during incremental load to avoid any + // duplicate copy from the source. Check HIVE-21197 for more detail. + int stmtId = (writeId == ReplUtils.REPL_BOOTSTRAP_MIGRATION_BASE_WRITE_ID) ? 0 : Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259312676 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/bootstrap/load/LoadDatabase.java ## @@ -158,6 +159,15 @@ private boolean isDbEmpty(String dbName) throws HiveException { // Add the checkpoint key to the Database binding it to current dump directory. // So, if retry using same dump, we shall skip Database object update. parameters.put(ReplUtils.REPL_CHECKPOINT_KEY, dumpDirectory); + +if (needSetIncFlag) { Review comment: in case of alter case need not set the flag .. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259312460 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/MetaStoreCompactorThread.java ## @@ -71,6 +74,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { +try { + Database database = rs.getDatabase(getDefaultCatalog(conf), dbName); + if (database != null) { +return !ReplUtils.isFirstIncDone(database.getParameters()); + } + LOG.info("Unable to find database " + dbName); Review comment: same as other method in compactor thread This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259310998 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplSetFirstIncLoadFlagDesc.java ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.plan; +import org.apache.hadoop.hive.ql.plan.Explain.Level; + +import java.io.Serializable; + +/** + * ReplSetFirstIncLoadFlagDesc. + * + */ +@Explain(displayName = "Set First Incr Load Flag", explainLevels = { Level.USER, Level.DEFAULT, Level.EXTENDED }) +public class ReplSetFirstIncLoadFlagDesc extends DDLDesc implements Serializable { + + private static final long serialVersionUID = 1L; + String databaseName; + String tableName; + + /** + * For serialization only. + */ + public ReplSetFirstIncLoadFlagDesc() { + } + + public ReplSetFirstIncLoadFlagDesc(String databaseName, String tableName) { Review comment: as its just setting it to false always ..changed the name of the method This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259310718 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1147,6 +1145,12 @@ private static void createReplImportTasks( if (!waitOnPrecursor){ throw new SemanticException(ErrorMsg.DATABASE_NOT_EXISTS.getMsg(tblDesc.getDatabaseName())); } + // For warehouse level replication, if the database itself is getting created in this load, then no need to + // check for duplicate copy. Check HIVE-21197 for more detail. + firstIncDone = true; Review comment: if check was already thread and ternary will not give any perf advantage This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259310336 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java ## @@ -1136,6 +1133,7 @@ private static void createReplImportTasks( Task dropTblTask = null; WriteEntity.WriteType lockType = WriteEntity.WriteType.DDL_NO_LOCK; +Boolean firstIncDone; Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259310189 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -187,4 +188,12 @@ public static PathFilter getEventsDirectoryFilter(final FileSystem fs) { } }; } + + public static boolean isFirstIncDone(Map parameter) { +if (parameter == null) { + return true; +} +String compFlag = parameter.get(ReplUtils.REPL_FIRST_INC_PENDING_FLAG); Review comment: its from map and not from conf . Not done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259309835 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -271,12 +302,13 @@ public String getName() { LOG.debug("ReplCopyTask:getLoadCopyTask: {}=>{}", srcPath, dstPath); if ((replicationSpec != null) && replicationSpec.isInReplicationScope()){ ReplCopyWork rcwork = new ReplCopyWork(srcPath, dstPath, false); - if (replicationSpec.isReplace() && conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION)) { + if (replicationSpec.isReplace() && (conf.getBoolVar(REPL_ENABLE_MOVE_OPTIMIZATION) || copyToMigratedTxnTable)) { rcwork.setDeleteDestIfExist(true); rcwork.setAutoPurge(isAutoPurge); rcwork.setNeedRecycle(needRecycle); } rcwork.setCopyToMigratedTxnTable(copyToMigratedTxnTable); + rcwork.setCheckDuplicateCopy(replicationSpec.needDupCopyCheck()); Review comment: not done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259309699 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -61,6 +62,25 @@ public ReplCopyTask(){ super(); } + // If file is already present in base directory, then remove it from the list. + // Check HIVE-21197 for more detail + private void updateSrcFileListForDupCopy(FileSystem dstFs, Path toPath, List srcFiles, + long writeId, int stmtId) throws IOException { +ListIterator iter = srcFiles.listIterator(); +Path basePath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(true, writeId, writeId, stmtId)); +while (iter.hasNext()) { + Path filePath = new Path(basePath, iter.next().getSourcePath().getName()); + try { +if (dstFs.getFileStatus(filePath) != null) { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259309300 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigration - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationMisc { + @Rule + public final TestName testName = new TestName(); + + protected static final Logger LOG = LoggerFactory.getLogger(TestReplicationWithTableMigrationMisc.class); + private static WarehouseInstance primary, replica; + private String primaryDbName, replicatedDbName; + + @BeforeClass + public static void classLevelSetup() throws Exception { +HashMap overrideProperties = new HashMap<>(); +internalBeforeClassSetup(overrideProperties); + } + + static void internalBeforeClassSetup(Map overrideConfigs) throws Exception { +HiveConf conf = new HiveConf(TestReplicationWithTableMigrationMisc.class); +conf.set("dfs.client.use.datanode.hostname", "true"); +conf.set("hadoop.proxyuser." + Utils.getUGI().getShortUserName() + ".hosts", "*"); +MiniDFSCluster miniDFSCluster = +new MiniDFSCluster.Builder(conf).numDataNodes(1).format(true).build(); +final DistributedFileSystem fs = miniDFSCluster.getFileSystem(); +HashMap hiveConfigs = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.support.concurrency", "true"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DbTxnManager"); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.strict.managed.tables", "true"); + put("hive.metastore.transactional.event.listeners", ""); +}}; +replica = new WarehouseInstance(LOG, miniDFSCluster, hiveConfigs); + +HashMap configsForPrimary = new HashMap() {{ + put("fs.defaultFS", fs.getUri().toString()); + put("hive.metastore.client.capability.check", "false"); + put("hive.repl.bootstrap.dump.open.txn.timeout", "1s"); + put("hive.exec.dynamic.partition.mode", "nonstrict"); + put("hive.strict.checks.bucketing", "false"); + put("hive.mapred.mode", "nonstrict"); + put("mapred.input.dir.recursive", "true"); + put("hive.metastore.disallow.incompatible.col.type.changes", "false"); + put("hive.support.concurrency", "false"); + put("hive.txn.manager", "org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager"); + put("hive.strict.managed.tables", "false"); +}}; +configsForPrimary.putAll(overrideConfigs); +primary = new
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259308757 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigration - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationMisc { Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259308804 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; Review comment: not done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259308377 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosAcrossInstances.java ## @@ -610,6 +610,11 @@ public void testIncrementalDumpOfWarehouse() throws Throwable { .run("show tables") .verifyResults(new String[] { "t1" }); +//default db is already created at replica before load. So the incr flag should not be set. + assertTrue(ReplUtils.isFirstIncDone(replica.getDatabase("default").getParameters())); + assertTrue(!ReplUtils.isFirstIncDone(replica.getDatabase(primaryDbName).getParameters())); Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259283514 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -60,6 +60,7 @@ public static final String LAST_REPL_ID_KEY = "hive.repl.last.repl.id"; public static final String REPL_CHECKPOINT_KEY = "hive.repl.ckpt.key"; + public static final String REPL_FIRST_INC_PENDING_FLAG = "hive.repl.first.inc.pending"; Review comment: add a TODO for external bootstrap during inc load This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259283120 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -60,6 +60,7 @@ public static final String LAST_REPL_ID_KEY = "hive.repl.last.repl.id"; public static final String REPL_CHECKPOINT_KEY = "hive.repl.ckpt.key"; + public static final String REPL_FIRST_INC_PENDING_FLAG = "hive.repl.first.inc.pending"; Review comment: add a comment to explain This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259281934 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/RemoteCompactorThread.java ## @@ -53,6 +56,20 @@ public void init(AtomicBoolean stop, AtomicBoolean looped) throws Exception { } } + @Override boolean replIsCompactionDisabledForDatabase(String dbName) throws TException { Review comment: all methods are duplicated so kept it same This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259280119 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ReplCopyTask.java ## @@ -133,8 +161,11 @@ protected int execute(DriverContext driverContext) { return 6; } long writeId = Long.parseLong(writeIdString); - toPath = new Path(toPath, AcidUtils.baseOrDeltaSubdir(work.getDeleteDestIfExist(), writeId, writeId, - driverContext.getCtx().getHiveTxnManager().getStmtIdAndIncrement())); + // Set stmt id 0 for bootstrap load as the directory needs to be searched during incremental load to avoid any + // duplicate copy from the source. Check HIVE-21197 for more detail. + int stmtId = (writeId == ReplUtils.REPL_BOOTSTRAP_MIGRATION_BASE_WRITE_ID) ? 0 : Review comment: dont use magic number 0 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259278076 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java ## @@ -661,6 +663,10 @@ public int execute(DriverContext driverContext) { if (work.getAlterMaterializedViewDesc() != null) { return alterMaterializedView(db, work.getAlterMaterializedViewDesc()); } + + if (work.getReplSetFirstIncLoadFlagDesc() != null) { Review comment: in repl inc load task ..get the database list and alter the property ..remove table level property This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled
maheshk114 commented on a change in pull request #541: HIVE-21197 : Hive Replication can add duplicate data during migration to a target with hive.strict.managed.tables enabled URL: https://github.com/apache/hive/pull/541#discussion_r259274648 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationWithTableMigrationMisc.java ## @@ -0,0 +1,233 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.parse; + +import org.apache.hadoop.hdfs.DistributedFileSystem; +import org.apache.hadoop.hdfs.MiniDFSCluster; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore; +import org.apache.hadoop.hive.metastore.InjectableBehaviourObjectStore.BehaviourInjection; +import org.apache.hadoop.hive.metastore.api.CurrentNotificationEventId; +import org.apache.hadoop.hive.ql.exec.repl.util.ReplUtils; +import org.apache.hadoop.hive.shims.Utils; +import org.junit.*; +import org.junit.rules.TestName; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import javax.annotation.Nullable; +import java.io.IOException; +import java.util.*; + +import static org.apache.hadoop.hive.metastore.ReplChangeManager.SOURCE_OF_REPLICATION; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isFullAcidTable; +import static org.apache.hadoop.hive.ql.io.AcidUtils.isTransactionalTable; +import static org.junit.Assert.*; + +/** + * TestReplicationWithTableMigration - test replication for Hive2 to Hive3 (Strict managed tables) + */ +public class TestReplicationWithTableMigrationMisc { Review comment: change it to Ex This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services