[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262814876 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -279,6 +292,72 @@ a database ( directory ) return 0; } + /** + * Cleanup/drop tables from the given database which are bootstrapped by input dump dir. + * @throws HiveException Failed to drop the tables. + * @throws IOException File operations failure. + * @throws InvalidInputException Invalid input dump directory. + */ + private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) +.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); +FileSystem fs = bootstrapDirectory.getFileSystem(conf); + +if (!fs.exists(bootstrapDirectory)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + + bootstrapDirectory); +} + +FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, EximUtil.getDirectoryFilter(fs)); +if ((fileStatuses == null) || (fileStatuses.length == 0)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback is empty: " + + bootstrapDirectory); +} + +if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 1)) { + throw new InvalidInputException("Multiple DB dirs in the dump: " + bootstrapDirectory + + " is not allowed to load to single target DB: " + work.dbNameToLoadIn); +} + +for (FileStatus dbDir : fileStatuses) { + Path dbLevelPath = dbDir.getPath(); + String dbNameInDump = dbLevelPath.getName(); + + List tableNames = new ArrayList<>(); + RemoteIterator filesIterator = fs.listFiles(dbLevelPath, true); + while (filesIterator.hasNext()) { +Path nextFile = filesIterator.next().getPath(); +String filePath = nextFile.toString(); +if (filePath.endsWith(EximUtil.METADATA_NAME)) { + // Remove dbLevelPath from the current path to check if this _metadata file is under DB or + // table level directory. + String replacedString = filePath.replace(dbLevelPath.toString(), ""); + if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) { +tableNames.add(nextFile.getParent().getName()); Review comment: Done. As discussed DatabaseEventsIterator will be used to fetch table names. This is inefficient but shouldn't be a problem as this flow is very rare. Btw, DatabaseEventsIterator will continue to assume that any directory under DB level dump dir having _metadata as table dump. So, this patch won't address this and continue to assume it won't change in future. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262814921 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -298,25 +298,26 @@ a database ( directory ) * @throws IOException File operations failure. * @throws InvalidInputException Invalid input dump directory. */ - private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { -Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) + private void cleanTablesFromBootstrap() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables) .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); FileSystem fs = bootstrapDirectory.getFileSystem(conf); if (!fs.exists(bootstrapDirectory)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is invalid: " Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262815016 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -298,25 +298,26 @@ a database ( directory ) * @throws IOException File operations failure. * @throws InvalidInputException Invalid input dump directory. */ - private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { -Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) + private void cleanTablesFromBootstrap() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables) .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); FileSystem fs = bootstrapDirectory.getFileSystem(conf); if (!fs.exists(bootstrapDirectory)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is invalid: " + bootstrapDirectory); } FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, EximUtil.getDirectoryFilter(fs)); if ((fileStatuses == null) || (fileStatuses.length == 0)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback is empty: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is empty: " + bootstrapDirectory); } if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 1)) { - throw new InvalidInputException("Multiple DB dirs in the dump: " + bootstrapDirectory - + " is not allowed to load to single target DB: " + work.dbNameToLoadIn); + throw new InvalidInputException("Input bootstrap dump directory to clean tables has multiple" + + " DB dirs in the dump: " + bootstrapDirectory + + " which is not allowed on single target DB: " + work.dbNameToLoadIn); Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262814876 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -279,6 +292,72 @@ a database ( directory ) return 0; } + /** + * Cleanup/drop tables from the given database which are bootstrapped by input dump dir. + * @throws HiveException Failed to drop the tables. + * @throws IOException File operations failure. + * @throws InvalidInputException Invalid input dump directory. + */ + private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) +.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); +FileSystem fs = bootstrapDirectory.getFileSystem(conf); + +if (!fs.exists(bootstrapDirectory)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + + bootstrapDirectory); +} + +FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, EximUtil.getDirectoryFilter(fs)); +if ((fileStatuses == null) || (fileStatuses.length == 0)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback is empty: " + + bootstrapDirectory); +} + +if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 1)) { + throw new InvalidInputException("Multiple DB dirs in the dump: " + bootstrapDirectory + + " is not allowed to load to single target DB: " + work.dbNameToLoadIn); +} + +for (FileStatus dbDir : fileStatuses) { + Path dbLevelPath = dbDir.getPath(); + String dbNameInDump = dbLevelPath.getName(); + + List tableNames = new ArrayList<>(); + RemoteIterator filesIterator = fs.listFiles(dbLevelPath, true); + while (filesIterator.hasNext()) { +Path nextFile = filesIterator.next().getPath(); +String filePath = nextFile.toString(); +if (filePath.endsWith(EximUtil.METADATA_NAME)) { + // Remove dbLevelPath from the current path to check if this _metadata file is under DB or + // table level directory. + String replacedString = filePath.replace(dbLevelPath.toString(), ""); + if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) { +tableNames.add(nextFile.getParent().getName()); Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262814955 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -573,23 +577,32 @@ public void retryBootstrapExternalTablesFromDifferentDump() throws Throwable { .run("create table t5 as select * from t4") .dump(primaryDbName, tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause); -// Verify if bootstrapping with same dump is idempotent and return same result -for (int i = 0; i < 2; i++) { - replica.load(replicatedDbName, tupleIncWithExternalBootstrap.dumpLocation, loadWithClause) - .status(replicatedDbName) - .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId) - .run("use " + replicatedDbName) - .run("show tables like 't1'") - .verifyFailure(new String[]{"t1"}) - .run("select place from t2 where country = 'us'") - .verifyResult("austin") - .run("select id from t4") - .verifyResult("10") - .run("select id from t5") - .verifyResult("10"); +// Fail setting ckpt property for table t4 but success for t2. +BehaviourInjection callerVerifier += new BehaviourInjection() { + @Nullable + @Override + public Boolean apply(@Nullable CallerArguments args) { +if (args.tblName.equalsIgnoreCase("t4") && args.dbName.equalsIgnoreCase(replicatedDbName)) { + injectionPathCalled = true; + LOG.warn("Verifier - DB : " + args.dbName + " TABLE : " + args.tblName); + return false; +} +return true; + } +}; + +// Fail repl load before the ckpt property is set for t4 and after it is set for t2. +// In the retry, these half baked tables should be dropped and bootstrap should ve successful. Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262804445 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -279,6 +292,72 @@ a database ( directory ) return 0; } + /** + * Cleanup/drop tables from the given database which are bootstrapped by input dump dir. + * @throws HiveException Failed to drop the tables. + * @throws IOException File operations failure. + * @throws InvalidInputException Invalid input dump directory. + */ + private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) +.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); +FileSystem fs = bootstrapDirectory.getFileSystem(conf); + +if (!fs.exists(bootstrapDirectory)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + + bootstrapDirectory); +} + +FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, EximUtil.getDirectoryFilter(fs)); +if ((fileStatuses == null) || (fileStatuses.length == 0)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback is empty: " + + bootstrapDirectory); +} + +if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 1)) { + throw new InvalidInputException("Multiple DB dirs in the dump: " + bootstrapDirectory + + " is not allowed to load to single target DB: " + work.dbNameToLoadIn); +} + +for (FileStatus dbDir : fileStatuses) { + Path dbLevelPath = dbDir.getPath(); + String dbNameInDump = dbLevelPath.getName(); + + List tableNames = new ArrayList<>(); + RemoteIterator filesIterator = fs.listFiles(dbLevelPath, true); + while (filesIterator.hasNext()) { +Path nextFile = filesIterator.next().getPath(); +String filePath = nextFile.toString(); +if (filePath.endsWith(EximUtil.METADATA_NAME)) { + // Remove dbLevelPath from the current path to check if this _metadata file is under DB or + // table level directory. + String replacedString = filePath.replace(dbLevelPath.toString(), ""); + if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) { +tableNames.add(nextFile.getParent().getName()); Review comment: We don't need a new iterator. DatabaseEventsIterator is already doing this job for us. It also returns different file system events based on the kind of metadata file it encounters. So we can use the type of events to separate the table ones from non-table ones. Can we use it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262801299 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -298,25 +298,26 @@ a database ( directory ) * @throws IOException File operations failure. * @throws InvalidInputException Invalid input dump directory. */ - private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { -Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) + private void cleanTablesFromBootstrap() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables) .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); FileSystem fs = bootstrapDirectory.getFileSystem(conf); if (!fs.exists(bootstrapDirectory)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is invalid: " Review comment: Grammar suggestion "Bootstrap dump directory specified to clean tables from is invalid" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262801475 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -298,25 +298,26 @@ a database ( directory ) * @throws IOException File operations failure. * @throws InvalidInputException Invalid input dump directory. */ - private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { -Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) + private void cleanTablesFromBootstrap() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables) .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); FileSystem fs = bootstrapDirectory.getFileSystem(conf); if (!fs.exists(bootstrapDirectory)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is invalid: " + bootstrapDirectory); } FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, EximUtil.getDirectoryFilter(fs)); if ((fileStatuses == null) || (fileStatuses.length == 0)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback is empty: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is empty: " + bootstrapDirectory); } if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 1)) { - throw new InvalidInputException("Multiple DB dirs in the dump: " + bootstrapDirectory - + " is not allowed to load to single target DB: " + work.dbNameToLoadIn); + throw new InvalidInputException("Input bootstrap dump directory to clean tables has multiple" + + " DB dirs in the dump: " + bootstrapDirectory + + " which is not allowed on single target DB: " + work.dbNameToLoadIn); Review comment: Similar grammar suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262803005 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -279,6 +292,72 @@ a database ( directory ) return 0; } + /** + * Cleanup/drop tables from the given database which are bootstrapped by input dump dir. + * @throws HiveException Failed to drop the tables. + * @throws IOException File operations failure. + * @throws InvalidInputException Invalid input dump directory. + */ + private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) +.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); +FileSystem fs = bootstrapDirectory.getFileSystem(conf); + +if (!fs.exists(bootstrapDirectory)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + + bootstrapDirectory); +} + +FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, EximUtil.getDirectoryFilter(fs)); +if ((fileStatuses == null) || (fileStatuses.length == 0)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback is empty: " + + bootstrapDirectory); +} + +if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 1)) { + throw new InvalidInputException("Multiple DB dirs in the dump: " + bootstrapDirectory + + " is not allowed to load to single target DB: " + work.dbNameToLoadIn); +} + +for (FileStatus dbDir : fileStatuses) { + Path dbLevelPath = dbDir.getPath(); + String dbNameInDump = dbLevelPath.getName(); + + List tableNames = new ArrayList<>(); + RemoteIterator filesIterator = fs.listFiles(dbLevelPath, true); + while (filesIterator.hasNext()) { +Path nextFile = filesIterator.next().getPath(); +String filePath = nextFile.toString(); +if (filePath.endsWith(EximUtil.METADATA_NAME)) { + // Remove dbLevelPath from the current path to check if this _metadata file is under DB or + // table level directory. + String replacedString = filePath.replace(dbLevelPath.toString(), ""); + if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) { +tableNames.add(nextFile.getParent().getName()); Review comment: Well, if this code is the only duplicate copy, it makes sense to do this refactoring while we are adding the duplicate code. That avoids duplicate code in the first place. We might not change the format of the dump directory but we might add _metadata file to other object specific directories. This will mean that the code above will mistake objects other than table to be tables and try dropping those. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262800301 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -573,23 +577,32 @@ public void retryBootstrapExternalTablesFromDifferentDump() throws Throwable { .run("create table t5 as select * from t4") .dump(primaryDbName, tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause); -// Verify if bootstrapping with same dump is idempotent and return same result -for (int i = 0; i < 2; i++) { - replica.load(replicatedDbName, tupleIncWithExternalBootstrap.dumpLocation, loadWithClause) - .status(replicatedDbName) - .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId) - .run("use " + replicatedDbName) - .run("show tables like 't1'") - .verifyFailure(new String[]{"t1"}) - .run("select place from t2 where country = 'us'") - .verifyResult("austin") - .run("select id from t4") - .verifyResult("10") - .run("select id from t5") - .verifyResult("10"); +// Fail setting ckpt property for table t4 but success for t2. +BehaviourInjection callerVerifier += new BehaviourInjection() { + @Nullable + @Override + public Boolean apply(@Nullable CallerArguments args) { +if (args.tblName.equalsIgnoreCase("t4") && args.dbName.equalsIgnoreCase(replicatedDbName)) { + injectionPathCalled = true; + LOG.warn("Verifier - DB : " + args.dbName + " TABLE : " + args.tblName); + return false; +} +return true; + } +}; + +// Fail repl load before the ckpt property is set for t4 and after it is set for t2. +// In the retry, these half baked tables should be dropped and bootstrap should ve successful. Review comment: "should ve" - typo, should be "should be" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262801366 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -298,25 +298,26 @@ a database ( directory ) * @throws IOException File operations failure. * @throws InvalidInputException Invalid input dump directory. */ - private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { -Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) + private void cleanTablesFromBootstrap() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables) .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); FileSystem fs = bootstrapDirectory.getFileSystem(conf); if (!fs.exists(bootstrapDirectory)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is invalid: " + bootstrapDirectory); } FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, EximUtil.getDirectoryFilter(fs)); if ((fileStatuses == null) || (fileStatuses.length == 0)) { - throw new InvalidInputException("Input bootstrap dump directory to rollback is empty: " + throw new InvalidInputException("Input bootstrap dump directory to clean tables is empty: " Review comment: Grammar suggestion similar to above. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262800104 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java ## @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema relOptSchema, SchemaPlu perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, "Calcite: Window fixing rule"); } + perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER); Review comment: Yes, that is what I meant. Maybe someone will pick it up :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262800893 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java ## @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptRuleOperand; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit; + +/** + * Planner rule that removes + * a {@link org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}. + * Note that this is different from HiveSortRemoveRule because this is not based on statistics + */ +public class HiveSortLimitRemoveRule extends RelOptRule { + + //~ Constructors --- + + public HiveSortLimitRemoveRule() { +this(operand(HiveSortLimit.class, any())); + } + + private HiveSortLimitRemoveRule(RelOptRuleOperand operand) { +super(operand); + } + + //~ Methods + + @Override + public boolean matches(RelOptRuleCall call) { +final HiveSortLimit sortLimit = call.rel(0); + +return HiveRelOptUtil.produceAtmostOneRow(sortLimit.getInput()); Review comment: The rules should be portable between planners, i.e., a rule should not be unwrapping a HepRelVertex. If you want to remain generic, a workaround is matching RelNode. However, I would try to make the rule matchers as restrictive as possible, since the more information we expose to the planner, the smarter decisions it can make, e.g., avoiding unnecessary triggering of rules. You do not need recursion. When the rule is executed, we should not have multiple consecutive Project operators in the plan. If there were, we would get rid of them using ProjectMerge and ProjectRemove rules. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262800104 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java ## @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema relOptSchema, SchemaPlu perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, "Calcite: Window fixing rule"); } + perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER); Review comment: Yes, that is what I meant. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262792425 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java ## @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema relOptSchema, SchemaPlu perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, "Calcite: Window fixing rule"); } + perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER); Review comment: @jcamachor I am not sure if I understand the point fully. In your example ORDER BY is already being removed. Are you suggesting to open JIRA to remove extra aggregate? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262789953 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java ## @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptRuleOperand; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit; + +/** + * Planner rule that removes + * a {@link org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}. + * Note that this is different from HiveSortRemoveRule because this is not based on statistics + */ +public class HiveSortLimitRemoveRule extends RelOptRule { + + //~ Constructors --- + + public HiveSortLimitRemoveRule() { +this(operand(HiveSortLimit.class, any())); + } + + private HiveSortLimitRemoveRule(RelOptRuleOperand operand) { +super(operand); + } + + //~ Methods + + @Override + public boolean matches(RelOptRuleCall call) { +final HiveSortLimit sortLimit = call.rel(0); + +return HiveRelOptUtil.produceAtmostOneRow(sortLimit.getInput()); Review comment: @jcamachor The rule uses recursive method which could traverse the whole tree so restricting this rule to a specific pattern does not really make sense (beside matching HiveSortLimit). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262699879 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java ## @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptRuleOperand; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit; + +/** + * Planner rule that removes + * a {@link org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}. + * Note that this is different from HiveSortRemoveRule because this is not based on statistics + */ +public class HiveSortLimitRemoveRule extends RelOptRule { + Review comment: Create static final INSTANCE(s) for the rule to avoid multiple instantiations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262715669 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java ## @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, RelOptSchema relOptSchema, SchemaPlu perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, "Calcite: Window fixing rule"); } + perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER); Review comment: I believe we may still compute aggregations that are not used in the plan output. Could include the following test to confirm: ``` explain cbo SELECT COUNT(*) FROM t1 ORDER BY SUM(col), COUNT(*); ``` I believe this is not critical, but we can add the test and a note to the q file, and create a placeholder JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262700363 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java ## @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.hive.ql.optimizer.calcite.rules; + +import org.apache.calcite.plan.RelOptRule; +import org.apache.calcite.plan.RelOptRuleCall; +import org.apache.calcite.plan.RelOptRuleOperand; +import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil; +import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit; + +/** + * Planner rule that removes + * a {@link org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}. + * Note that this is different from HiveSortRemoveRule because this is not based on statistics + */ +public class HiveSortLimitRemoveRule extends RelOptRule { + + //~ Constructors --- + + public HiveSortLimitRemoveRule() { +this(operand(HiveSortLimit.class, any())); + } + + private HiveSortLimitRemoveRule(RelOptRuleOperand operand) { +super(operand); + } + + //~ Methods + + @Override + public boolean matches(RelOptRuleCall call) { +final HiveSortLimit sortLimit = call.rel(0); + +return HiveRelOptUtil.produceAtmostOneRow(sortLimit.getInput()); Review comment: Instead of unwrapping HepRelVertex nodes using this method, the rule should have two variants, similar to other rules: one that matches SortLimit-Project-Aggregate and one that matches SortLimit-Aggregate. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557#discussion_r262705631 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java ## @@ -1047,4 +1052,37 @@ public static String toJsonString(final RelNode rel) { return planWriter.asString(); } + + /** + * Utility method to answer if given a rel plan it will produce at most + * one row. + */ + public static boolean produceAtmostOneRow(RelNode rel) { +if(rel instanceof HepRelVertex) { + rel = ((HepRelVertex)rel).getCurrentRel(); +} +if(rel instanceof HiveProject) { + if(((HiveProject)rel).hasWindowingExpr()) { Review comment: This check does not seem to be needed since the window expression will not alter the number of rows output by the Project. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] vineetgarg02 opened a new pull request #557: HIVE-21338 Remove order by and limit for aggregates
vineetgarg02 opened a new pull request #557: HIVE-21338 Remove order by and limit for aggregates URL: https://github.com/apache/hive/pull/557 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (HIVE-21395) Refactor HiveSemiJoinRule
Vineet Garg created HIVE-21395: -- Summary: Refactor HiveSemiJoinRule Key: HIVE-21395 URL: https://issues.apache.org/jira/browse/HIVE-21395 Project: Hive Issue Type: Improvement Components: Query Planning Affects Versions: 4.0.0 Reporter: Vineet Garg Assignee: Vineet Garg Following refactoring needs to be done: * Update the rule matching pattern to avoid using HepVertex * HIVE-21338 adds logic to determine if rel plan will produce at most one row. Use this in HiveSemiJoinRule -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21394) Do not use PTF in case of select distinct having all the non aggregate column in over
Miklos Gergely created HIVE-21394: - Summary: Do not use PTF in case of select distinct having all the non aggregate column in over Key: HIVE-21394 URL: https://issues.apache.org/jira/browse/HIVE-21394 Project: Hive Issue Type: Improvement Reporter: Miklos Gergely Having a select distinct query where all the non aggregate columns are enumerated in the over clause makes the over clause unnecessary. Yet the CalciltePlanner leaves it in the query after optimization. So for example in the query SELECT DISTINCT count(x) OVER () FROM t the over clause is meaningless. Still it remains there after Calcite optimization. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21393) Refactor exception catching in HiveAccumuloTableInputFormat
Laszlo Bodor created HIVE-21393: --- Summary: Refactor exception catching in HiveAccumuloTableInputFormat Key: HIVE-21393 URL: https://issues.apache.org/jira/browse/HIVE-21393 Project: Hive Issue Type: Bug Reporter: Laszlo Bodor -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] [hive] dlavati-hw removed a comment on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex
dlavati-hw removed a comment on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex URL: https://github.com/apache/hive/pull/553#issuecomment-469625477 @MichaelChirico for a more verbose explanation you can refer to https://cwiki.apache.org/confluence/display/Hive/HowToContribute This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] dlavati commented on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex
dlavati commented on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex URL: https://github.com/apache/hive/pull/553#issuecomment-469625691 @MichaelChirico for a more verbose explanation you can refer to https://cwiki.apache.org/confluence/display/Hive/HowToContribute This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] dlavati-hw commented on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex
dlavati-hw commented on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex URL: https://github.com/apache/hive/pull/553#issuecomment-469625477 @MichaelChirico for a more verbose explanation you can refer to https://cwiki.apache.org/confluence/display/Hive/HowToContribute This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] coder-chenzhi opened a new pull request #556: HIVE-21392 Fix misconfigurations of DataNucleus log in log4j.properties
coder-chenzhi opened a new pull request #556: HIVE-21392 Fix misconfigurations of DataNucleus log in log4j.properties URL: https://github.com/apache/hive/pull/556 HIVE-21392 Misconfigurations of DataNucleus log in log4j.properties In the patch of [HIVE-12020](https://issues.apache.org/jira/browse/HIVE-12020), we changed the DataNucleus related logging configuration from nine fine-grained loggers with three coarse-grained loggers (DataNucleus, Datastore and JPOX). As Prasanth Jayachandran [explain](https://issues.apache.org/jira/browse/HIVE-12020?focusedCommentId=15025612=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15025612), these three loggers are the top-level logger in DataNucleus, so that we don't need to specify other loggers for DataNucleus. However, according to the [documents](http://www.datanucleus.org/products/accessplatform/logging.html) and [source codes](https://github.com/datanucleus/datanucleus-core/blob/master/src/main/java/org/datanucleus/util/NucleusLogger.java#L108) of DataNucleus, the top-level logger in DataNucleus is `DataNucleus`. Therefore, we just need to keep the right one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262415167 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -534,6 +536,90 @@ public void bootstrapExternalTablesDuringIncrementalPhase() throws Throwable { .verifyResults(Arrays.asList("10", "20")); } + @Test + public void retryBootstrapExternalTablesFromDifferentDump() throws Throwable { +List loadWithClause = new ArrayList<>(); +loadWithClause.addAll(externalTableBasePathWithClause()); + +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='false'" +); + +WarehouseInstance.Tuple tupleBootstrapWithoutExternal = primary +.run("use " + primaryDbName) +.run("create external table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create external table t2 (place string) partitioned by (country string)") +.run("insert into table t2 partition(country='india') values ('bangalore')") +.run("insert into table t2 partition(country='us') values ('austin')") +.run("create table t3 as select * from t1") +.dump(primaryDbName, null, dumpWithClause); + +replica.load(replicatedDbName, tupleBootstrapWithoutExternal.dumpLocation, loadWithClause) +.status(replicatedDbName) +.verifyResult(tupleBootstrapWithoutExternal.lastReplicationId) +.run("use " + replicatedDbName) +.run("show tables") +.verifyResult("t3") +.run("select id from t3") +.verifyResult("1"); + +dumpWithClause = Arrays.asList("'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'", +"'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='true'"); +WarehouseInstance.Tuple tupleIncWithExternalBootstrap = primary.run("use " + primaryDbName) +.run("drop table t1") +.run("create external table t4 (id int)") +.run("insert into table t4 values (10)") +.run("create table t5 as select * from t4") +.dump(primaryDbName, tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause); + +// Verify if bootstrapping with same dump is idempotent and return same result +for (int i = 0; i < 2; i++) { + replica.load(replicatedDbName, tupleIncWithExternalBootstrap.dumpLocation, loadWithClause) + .status(replicatedDbName) + .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId) + .run("use " + replicatedDbName) + .run("show tables like 't1'") + .verifyFailure(new String[]{"t1"}) + .run("select place from t2 where country = 'us'") + .verifyResult("austin") + .run("select id from t4") + .verifyResult("10") + .run("select id from t5") + .verifyResult("10"); +} + +// Drop an external table, add another managed table with same name, insert into existing external table +// and dump another bootstrap dump for external tables. +WarehouseInstance.Tuple tupleNewIncWithExternalBootstrap = primary.run("use " + primaryDbName) +.run("insert into table t2 partition(country='india') values ('chennai')") +.run("drop table t2") +.run("create table t2 as select * from t4") +.run("insert into table t4 values (20)") +.dump(primaryDbName, tupleIncWithExternalBootstrap.lastReplicationId, dumpWithClause); + +// Set previous dump as bootstrap to be rolled-back. Now, new bootstrap should overwrite the old one. +loadWithClause.add("'" + REPL_ROLLBACK_BOOTSTRAP_LOAD_CONFIG + "'='" ++ tupleIncWithExternalBootstrap.dumpLocation + "'"); Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262415069 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java ## @@ -279,6 +292,72 @@ a database ( directory ) return 0; } + /** + * Cleanup/drop tables from the given database which are bootstrapped by input dump dir. + * @throws HiveException Failed to drop the tables. + * @throws IOException File operations failure. + * @throws InvalidInputException Invalid input dump directory. + */ + private void bootstrapRollbackTask() throws HiveException, IOException, InvalidInputException { +Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback) +.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build(); +FileSystem fs = bootstrapDirectory.getFileSystem(conf); + +if (!fs.exists(bootstrapDirectory)) { + throw new InvalidInputException("Input bootstrap dump directory to rollback doesn't exist: " + + bootstrapDirectory); Review comment: Done, This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (HIVE-21392) Misconfigurations of DataNucleus log in log4j.properties
Chen Zhi created HIVE-21392: --- Summary: Misconfigurations of DataNucleus log in log4j.properties Key: HIVE-21392 URL: https://issues.apache.org/jira/browse/HIVE-21392 Project: Hive Issue Type: Improvement Components: Logging Affects Versions: 2.0.0 Reporter: Chen Zhi In the patch of [HIVE-12020|https://issues.apache.org/jira/browse/HIVE-12020], we changed the DataNucleus related logging configuration from nine fine-grained loggers with three coarse-grained loggers (DataNucleus, Datastore and JPOX). As Prasanth Jayachandran [explain|https://issues.apache.org/jira/browse/HIVE-12020?focusedCommentId=15025612=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15025612], these three loggers are the top-level logger in DataNucleus, so that we don't need to specify other loggers for DataNucleus. However, according to the [documents|http://www.datanucleus.org/products/accessplatform/logging.html] and [source codes|https://github.com/datanucleus/datanucleus-core/blob/master/src/main/java/org/datanucleus/util/NucleusLogger.java#L108] of DataNucleus, the top-level logger in DataNucleus is `DataNucleus`. Therefore, we just need to keep the right one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.
sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump. URL: https://github.com/apache/hive/pull/551#discussion_r262403171 ## File path: itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java ## @@ -534,6 +536,90 @@ public void bootstrapExternalTablesDuringIncrementalPhase() throws Throwable { .verifyResults(Arrays.asList("10", "20")); } + @Test + public void retryBootstrapExternalTablesFromDifferentDump() throws Throwable { +List loadWithClause = new ArrayList<>(); +loadWithClause.addAll(externalTableBasePathWithClause()); + +List dumpWithClause = Collections.singletonList( +"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='false'" +); + +WarehouseInstance.Tuple tupleBootstrapWithoutExternal = primary +.run("use " + primaryDbName) +.run("create external table t1 (id int)") +.run("insert into table t1 values (1)") +.run("create external table t2 (place string) partitioned by (country string)") +.run("insert into table t2 partition(country='india') values ('bangalore')") +.run("insert into table t2 partition(country='us') values ('austin')") +.run("create table t3 as select * from t1") +.dump(primaryDbName, null, dumpWithClause); + +replica.load(replicatedDbName, tupleBootstrapWithoutExternal.dumpLocation, loadWithClause) +.status(replicatedDbName) +.verifyResult(tupleBootstrapWithoutExternal.lastReplicationId) +.run("use " + replicatedDbName) +.run("show tables") +.verifyResult("t3") +.run("select id from t3") +.verifyResult("1"); + +dumpWithClause = Arrays.asList("'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'", +"'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + "'='true'"); +WarehouseInstance.Tuple tupleIncWithExternalBootstrap = primary.run("use " + primaryDbName) +.run("drop table t1") +.run("create external table t4 (id int)") +.run("insert into table t4 values (10)") +.run("create table t5 as select * from t4") +.dump(primaryDbName, tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause); + +// Verify if bootstrapping with same dump is idempotent and return same result +for (int i = 0; i < 2; i++) { + replica.load(replicatedDbName, tupleIncWithExternalBootstrap.dumpLocation, loadWithClause) + .status(replicatedDbName) + .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId) + .run("use " + replicatedDbName) + .run("show tables like 't1'") + .verifyFailure(new String[]{"t1"}) + .run("select place from t2 where country = 'us'") + .verifyResult("austin") + .run("select id from t4") + .verifyResult("10") + .run("select id from t5") + .verifyResult("10"); +} + +// Drop an external table, add another managed table with same name, insert into existing external table +// and dump another bootstrap dump for external tables. +WarehouseInstance.Tuple tupleNewIncWithExternalBootstrap = primary.run("use " + primaryDbName) +.run("insert into table t2 partition(country='india') values ('chennai')") +.run("drop table t2") +.run("create table t2 as select * from t4") +.run("insert into table t4 values (20)") +.dump(primaryDbName, tupleIncWithExternalBootstrap.lastReplicationId, dumpWithClause); + +// Set previous dump as bootstrap to be rolled-back. Now, new bootstrap should overwrite the old one. +loadWithClause.add("'" + REPL_ROLLBACK_BOOTSTRAP_LOAD_CONFIG + "'='" ++ tupleIncWithExternalBootstrap.dumpLocation + "'"); Review comment: Yes... will modify the current test case to fail the bootstrap in first place before retry. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[jira] [Created] (HIVE-21391) LLAP: Pool of column vector buffers can cause memory pressure
Prasanth Jayachandran created HIVE-21391: Summary: LLAP: Pool of column vector buffers can cause memory pressure Key: HIVE-21391 URL: https://issues.apache.org/jira/browse/HIVE-21391 Project: Hive Issue Type: Bug Components: llap Affects Versions: 4.0.0, 3.2.0 Reporter: Prasanth Jayachandran Where there are too many columns (in the order of 100s), with decimal, string types the column vector pool of buffers created here [https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/io/decode/EncodedDataConsumer.java#L59] can cause memory pressure. Example: 128 (poolSize) * 300 (numCols) * 1024 (batchSize) * 80 (decimalSize) ~= 3GB The pool size keeps increasing when there is slow consumer but fast llap io (SSDs) leading to GC pressure when all LLAP io threads read splits from same table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-21390) BI split strategy does work for blob stores
Prasanth Jayachandran created HIVE-21390: Summary: BI split strategy does work for blob stores Key: HIVE-21390 URL: https://issues.apache.org/jira/browse/HIVE-21390 Project: Hive Issue Type: Bug Affects Versions: 4.0.0, 3.2.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran BI split strategy cuts the split at block boundaries however there are no block boundaries in blob storage so we end up with 1 split for BI split strategy. -- This message was sent by Atlassian JIRA (v7.6.3#76005)