[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262814876
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -279,6 +292,72 @@ a database ( directory )
 return 0;
   }
 
+  /**
+   * Cleanup/drop tables from the given database which are bootstrapped by 
input dump dir.
+   * @throws HiveException Failed to drop the tables.
+   * @throws IOException File operations failure.
+   * @throws InvalidInputException Invalid input dump directory.
+   */
+  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
+FileSystem fs = bootstrapDirectory.getFileSystem(conf);
+
+if (!fs.exists(bootstrapDirectory)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  + bootstrapDirectory);
+}
+
+FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, 
EximUtil.getDirectoryFilter(fs));
+if ((fileStatuses == null) || (fileStatuses.length == 0)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback is empty: "
+  + bootstrapDirectory);
+}
+
+if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 
1)) {
+  throw new InvalidInputException("Multiple DB dirs in the dump: " + 
bootstrapDirectory
+  + " is not allowed to load to single target DB: " + 
work.dbNameToLoadIn);
+}
+
+for (FileStatus dbDir : fileStatuses) {
+  Path dbLevelPath = dbDir.getPath();
+  String dbNameInDump = dbLevelPath.getName();
+
+  List tableNames = new ArrayList<>();
+  RemoteIterator filesIterator = 
fs.listFiles(dbLevelPath, true);
+  while (filesIterator.hasNext()) {
+Path nextFile = filesIterator.next().getPath();
+String filePath = nextFile.toString();
+if (filePath.endsWith(EximUtil.METADATA_NAME)) {
+  // Remove dbLevelPath from the current path to check if this 
_metadata file is under DB or
+  // table level directory.
+  String replacedString = filePath.replace(dbLevelPath.toString(), "");
+  if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) {
+tableNames.add(nextFile.getParent().getName());
 
 Review comment:
   Done. As discussed DatabaseEventsIterator will be used to fetch table names. 
This is inefficient but shouldn't be a problem as this flow is very rare.
   Btw, DatabaseEventsIterator will continue to assume that any directory under 
DB level dump dir having _metadata as table dump. So, this patch won't address 
this and continue to assume it won't change in future.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262814921
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -298,25 +298,26 @@ a database ( directory )
* @throws IOException File operations failure.
* @throws InvalidInputException Invalid input dump directory.
*/
-  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
-Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+  private void cleanTablesFromBootstrap() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables)
 .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
 FileSystem fs = bootstrapDirectory.getFileSystem(conf);
 
 if (!fs.exists(bootstrapDirectory)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is invalid: "
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262815016
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -298,25 +298,26 @@ a database ( directory )
* @throws IOException File operations failure.
* @throws InvalidInputException Invalid input dump directory.
*/
-  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
-Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+  private void cleanTablesFromBootstrap() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables)
 .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
 FileSystem fs = bootstrapDirectory.getFileSystem(conf);
 
 if (!fs.exists(bootstrapDirectory)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is invalid: "
   + bootstrapDirectory);
 }
 
 FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, 
EximUtil.getDirectoryFilter(fs));
 if ((fileStatuses == null) || (fileStatuses.length == 0)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback is empty: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is empty: "
   + bootstrapDirectory);
 }
 
 if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 
1)) {
-  throw new InvalidInputException("Multiple DB dirs in the dump: " + 
bootstrapDirectory
-  + " is not allowed to load to single target DB: " + 
work.dbNameToLoadIn);
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables has multiple"
+  + " DB dirs in the dump: " + bootstrapDirectory
+  + " which is not allowed on single target DB: " + 
work.dbNameToLoadIn);
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262814876
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -279,6 +292,72 @@ a database ( directory )
 return 0;
   }
 
+  /**
+   * Cleanup/drop tables from the given database which are bootstrapped by 
input dump dir.
+   * @throws HiveException Failed to drop the tables.
+   * @throws IOException File operations failure.
+   * @throws InvalidInputException Invalid input dump directory.
+   */
+  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
+FileSystem fs = bootstrapDirectory.getFileSystem(conf);
+
+if (!fs.exists(bootstrapDirectory)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  + bootstrapDirectory);
+}
+
+FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, 
EximUtil.getDirectoryFilter(fs));
+if ((fileStatuses == null) || (fileStatuses.length == 0)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback is empty: "
+  + bootstrapDirectory);
+}
+
+if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 
1)) {
+  throw new InvalidInputException("Multiple DB dirs in the dump: " + 
bootstrapDirectory
+  + " is not allowed to load to single target DB: " + 
work.dbNameToLoadIn);
+}
+
+for (FileStatus dbDir : fileStatuses) {
+  Path dbLevelPath = dbDir.getPath();
+  String dbNameInDump = dbLevelPath.getName();
+
+  List tableNames = new ArrayList<>();
+  RemoteIterator filesIterator = 
fs.listFiles(dbLevelPath, true);
+  while (filesIterator.hasNext()) {
+Path nextFile = filesIterator.next().getPath();
+String filePath = nextFile.toString();
+if (filePath.endsWith(EximUtil.METADATA_NAME)) {
+  // Remove dbLevelPath from the current path to check if this 
_metadata file is under DB or
+  // table level directory.
+  String replacedString = filePath.replace(dbLevelPath.toString(), "");
+  if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) {
+tableNames.add(nextFile.getParent().getName());
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262814955
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##
 @@ -573,23 +577,32 @@ public void 
retryBootstrapExternalTablesFromDifferentDump() throws Throwable {
 .run("create table t5 as select * from t4")
 .dump(primaryDbName, 
tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause);
 
-// Verify if bootstrapping with same dump is idempotent and return same 
result
-for (int i = 0; i < 2; i++) {
-  replica.load(replicatedDbName, 
tupleIncWithExternalBootstrap.dumpLocation, loadWithClause)
-  .status(replicatedDbName)
-  .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId)
-  .run("use " + replicatedDbName)
-  .run("show tables like 't1'")
-  .verifyFailure(new String[]{"t1"})
-  .run("select place from t2 where country = 'us'")
-  .verifyResult("austin")
-  .run("select id from t4")
-  .verifyResult("10")
-  .run("select id from t5")
-  .verifyResult("10");
+// Fail setting ckpt property for table t4 but success for t2.
+BehaviourInjection callerVerifier
+= new BehaviourInjection() {
+  @Nullable
+  @Override
+  public Boolean apply(@Nullable CallerArguments args) {
+if (args.tblName.equalsIgnoreCase("t4") && 
args.dbName.equalsIgnoreCase(replicatedDbName)) {
+  injectionPathCalled = true;
+  LOG.warn("Verifier - DB : " + args.dbName + " TABLE : " + 
args.tblName);
+  return false;
+}
+return true;
+  }
+};
+
+// Fail repl load before the ckpt property is set for t4 and after it is 
set for t2.
+// In the retry, these half baked tables should be dropped and bootstrap 
should ve successful.
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive 
should support clean-up of previously bootstrapped tables when retry from 
different dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262804445
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -279,6 +292,72 @@ a database ( directory )
 return 0;
   }
 
+  /**
+   * Cleanup/drop tables from the given database which are bootstrapped by 
input dump dir.
+   * @throws HiveException Failed to drop the tables.
+   * @throws IOException File operations failure.
+   * @throws InvalidInputException Invalid input dump directory.
+   */
+  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
+FileSystem fs = bootstrapDirectory.getFileSystem(conf);
+
+if (!fs.exists(bootstrapDirectory)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  + bootstrapDirectory);
+}
+
+FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, 
EximUtil.getDirectoryFilter(fs));
+if ((fileStatuses == null) || (fileStatuses.length == 0)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback is empty: "
+  + bootstrapDirectory);
+}
+
+if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 
1)) {
+  throw new InvalidInputException("Multiple DB dirs in the dump: " + 
bootstrapDirectory
+  + " is not allowed to load to single target DB: " + 
work.dbNameToLoadIn);
+}
+
+for (FileStatus dbDir : fileStatuses) {
+  Path dbLevelPath = dbDir.getPath();
+  String dbNameInDump = dbLevelPath.getName();
+
+  List tableNames = new ArrayList<>();
+  RemoteIterator filesIterator = 
fs.listFiles(dbLevelPath, true);
+  while (filesIterator.hasNext()) {
+Path nextFile = filesIterator.next().getPath();
+String filePath = nextFile.toString();
+if (filePath.endsWith(EximUtil.METADATA_NAME)) {
+  // Remove dbLevelPath from the current path to check if this 
_metadata file is under DB or
+  // table level directory.
+  String replacedString = filePath.replace(dbLevelPath.toString(), "");
+  if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) {
+tableNames.add(nextFile.getParent().getName());
 
 Review comment:
   We don't need a new iterator. DatabaseEventsIterator is already doing this 
job for us. It also returns different file system events based on the kind of 
metadata file it encounters. So we can use the type of events to separate the 
table ones from non-table ones. Can we use it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive 
should support clean-up of previously bootstrapped tables when retry from 
different dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262801299
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -298,25 +298,26 @@ a database ( directory )
* @throws IOException File operations failure.
* @throws InvalidInputException Invalid input dump directory.
*/
-  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
-Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+  private void cleanTablesFromBootstrap() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables)
 .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
 FileSystem fs = bootstrapDirectory.getFileSystem(conf);
 
 if (!fs.exists(bootstrapDirectory)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is invalid: "
 
 Review comment:
   Grammar suggestion "Bootstrap dump directory specified to clean tables from 
is invalid"


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive 
should support clean-up of previously bootstrapped tables when retry from 
different dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262801475
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -298,25 +298,26 @@ a database ( directory )
* @throws IOException File operations failure.
* @throws InvalidInputException Invalid input dump directory.
*/
-  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
-Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+  private void cleanTablesFromBootstrap() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables)
 .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
 FileSystem fs = bootstrapDirectory.getFileSystem(conf);
 
 if (!fs.exists(bootstrapDirectory)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is invalid: "
   + bootstrapDirectory);
 }
 
 FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, 
EximUtil.getDirectoryFilter(fs));
 if ((fileStatuses == null) || (fileStatuses.length == 0)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback is empty: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is empty: "
   + bootstrapDirectory);
 }
 
 if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 
1)) {
-  throw new InvalidInputException("Multiple DB dirs in the dump: " + 
bootstrapDirectory
-  + " is not allowed to load to single target DB: " + 
work.dbNameToLoadIn);
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables has multiple"
+  + " DB dirs in the dump: " + bootstrapDirectory
+  + " which is not allowed on single target DB: " + 
work.dbNameToLoadIn);
 
 Review comment:
   Similar grammar suggestion.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive 
should support clean-up of previously bootstrapped tables when retry from 
different dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262803005
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -279,6 +292,72 @@ a database ( directory )
 return 0;
   }
 
+  /**
+   * Cleanup/drop tables from the given database which are bootstrapped by 
input dump dir.
+   * @throws HiveException Failed to drop the tables.
+   * @throws IOException File operations failure.
+   * @throws InvalidInputException Invalid input dump directory.
+   */
+  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
+FileSystem fs = bootstrapDirectory.getFileSystem(conf);
+
+if (!fs.exists(bootstrapDirectory)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  + bootstrapDirectory);
+}
+
+FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, 
EximUtil.getDirectoryFilter(fs));
+if ((fileStatuses == null) || (fileStatuses.length == 0)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback is empty: "
+  + bootstrapDirectory);
+}
+
+if (StringUtils.isNotBlank(work.dbNameToLoadIn) && (fileStatuses.length > 
1)) {
+  throw new InvalidInputException("Multiple DB dirs in the dump: " + 
bootstrapDirectory
+  + " is not allowed to load to single target DB: " + 
work.dbNameToLoadIn);
+}
+
+for (FileStatus dbDir : fileStatuses) {
+  Path dbLevelPath = dbDir.getPath();
+  String dbNameInDump = dbLevelPath.getName();
+
+  List tableNames = new ArrayList<>();
+  RemoteIterator filesIterator = 
fs.listFiles(dbLevelPath, true);
+  while (filesIterator.hasNext()) {
+Path nextFile = filesIterator.next().getPath();
+String filePath = nextFile.toString();
+if (filePath.endsWith(EximUtil.METADATA_NAME)) {
+  // Remove dbLevelPath from the current path to check if this 
_metadata file is under DB or
+  // table level directory.
+  String replacedString = filePath.replace(dbLevelPath.toString(), "");
+  if (!replacedString.equalsIgnoreCase(EximUtil.METADATA_NAME)) {
+tableNames.add(nextFile.getParent().getName());
 
 Review comment:
   Well, if this code is the only duplicate copy, it makes sense to do this 
refactoring while we are adding the duplicate code. That avoids duplicate code 
in the first place.
   
   We might not change the format of the dump directory but we might add 
_metadata file to other object specific directories. This will mean that the 
code above will mistake objects other than table to be tables and try dropping 
those.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive 
should support clean-up of previously bootstrapped tables when retry from 
different dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262800301
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##
 @@ -573,23 +577,32 @@ public void 
retryBootstrapExternalTablesFromDifferentDump() throws Throwable {
 .run("create table t5 as select * from t4")
 .dump(primaryDbName, 
tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause);
 
-// Verify if bootstrapping with same dump is idempotent and return same 
result
-for (int i = 0; i < 2; i++) {
-  replica.load(replicatedDbName, 
tupleIncWithExternalBootstrap.dumpLocation, loadWithClause)
-  .status(replicatedDbName)
-  .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId)
-  .run("use " + replicatedDbName)
-  .run("show tables like 't1'")
-  .verifyFailure(new String[]{"t1"})
-  .run("select place from t2 where country = 'us'")
-  .verifyResult("austin")
-  .run("select id from t4")
-  .verifyResult("10")
-  .run("select id from t5")
-  .verifyResult("10");
+// Fail setting ckpt property for table t4 but success for t2.
+BehaviourInjection callerVerifier
+= new BehaviourInjection() {
+  @Nullable
+  @Override
+  public Boolean apply(@Nullable CallerArguments args) {
+if (args.tblName.equalsIgnoreCase("t4") && 
args.dbName.equalsIgnoreCase(replicatedDbName)) {
+  injectionPathCalled = true;
+  LOG.warn("Verifier - DB : " + args.dbName + " TABLE : " + 
args.tblName);
+  return false;
+}
+return true;
+  }
+};
+
+// Fail repl load before the ckpt property is set for t4 and after it is 
set for t2.
+// In the retry, these half baked tables should be dropped and bootstrap 
should ve successful.
 
 Review comment:
   "should ve" - typo, should be "should be"


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
ashutosh-bapat commented on a change in pull request #551: HIVE-21286: Hive 
should support clean-up of previously bootstrapped tables when retry from 
different dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262801366
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -298,25 +298,26 @@ a database ( directory )
* @throws IOException File operations failure.
* @throws InvalidInputException Invalid input dump directory.
*/
-  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
-Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+  private void cleanTablesFromBootstrap() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToCleanTables)
 .addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
 FileSystem fs = bootstrapDirectory.getFileSystem(conf);
 
 if (!fs.exists(bootstrapDirectory)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is invalid: "
   + bootstrapDirectory);
 }
 
 FileStatus[] fileStatuses = fs.listStatus(bootstrapDirectory, 
EximUtil.getDirectoryFilter(fs));
 if ((fileStatuses == null) || (fileStatuses.length == 0)) {
-  throw new InvalidInputException("Input bootstrap dump directory to 
rollback is empty: "
+  throw new InvalidInputException("Input bootstrap dump directory to clean 
tables is empty: "
 
 Review comment:
   Grammar suggestion similar to above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order 
by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262800104
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
 ##
 @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, 
RelOptSchema relOptSchema, SchemaPlu
 perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, 
"Calcite: Window fixing rule");
   }
 
+  perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER);
 
 Review comment:
   Yes, that is what I meant. Maybe someone will pick it up :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order 
by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262800893
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java
 ##
 @@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
+
+/**
+ * Planner rule that removes
+ * a {@link 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}.
+ * Note that this is different from HiveSortRemoveRule because this is not 
based on statistics
+ */
+public class HiveSortLimitRemoveRule extends RelOptRule {
+
+  //~ Constructors ---
+
+  public HiveSortLimitRemoveRule() {
+this(operand(HiveSortLimit.class, any()));
+  }
+
+  private HiveSortLimitRemoveRule(RelOptRuleOperand operand) {
+super(operand);
+  }
+
+  //~ Methods 
+
+  @Override
+  public boolean matches(RelOptRuleCall call) {
+final HiveSortLimit sortLimit = call.rel(0);
+
+return HiveRelOptUtil.produceAtmostOneRow(sortLimit.getInput());
 
 Review comment:
   The rules should be portable between planners, i.e., a rule should not be 
unwrapping a HepRelVertex. If you want to remain generic, a workaround is 
matching RelNode. However, I would try to make the rule matchers as restrictive 
as possible, since the more information we expose to the planner, the smarter 
decisions it can make, e.g., avoiding unnecessary triggering of rules.
   You do not need recursion. When the rule is executed, we should not have 
multiple consecutive Project operators in the plan. If there were, we would get 
rid of them using ProjectMerge and ProjectRemove rules.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order 
by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262800104
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
 ##
 @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, 
RelOptSchema relOptSchema, SchemaPlu
 perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, 
"Calcite: Window fixing rule");
   }
 
+  perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER);
 
 Review comment:
   Yes, that is what I meant.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove 
order by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262792425
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
 ##
 @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, 
RelOptSchema relOptSchema, SchemaPlu
 perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, 
"Calcite: Window fixing rule");
   }
 
+  perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER);
 
 Review comment:
   @jcamachor I am not sure if I understand the point fully.  In your example 
ORDER BY is already being removed. Are you suggesting to open JIRA to remove 
extra aggregate? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
vineetgarg02 commented on a change in pull request #557: HIVE-21338 Remove 
order by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262789953
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java
 ##
 @@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
+
+/**
+ * Planner rule that removes
+ * a {@link 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}.
+ * Note that this is different from HiveSortRemoveRule because this is not 
based on statistics
+ */
+public class HiveSortLimitRemoveRule extends RelOptRule {
+
+  //~ Constructors ---
+
+  public HiveSortLimitRemoveRule() {
+this(operand(HiveSortLimit.class, any()));
+  }
+
+  private HiveSortLimitRemoveRule(RelOptRuleOperand operand) {
+super(operand);
+  }
+
+  //~ Methods 
+
+  @Override
+  public boolean matches(RelOptRuleCall call) {
+final HiveSortLimit sortLimit = call.rel(0);
+
+return HiveRelOptUtil.produceAtmostOneRow(sortLimit.getInput());
 
 Review comment:
   @jcamachor The rule uses recursive method which could traverse the whole 
tree so restricting this rule to a specific pattern does not really make sense 
(beside matching HiveSortLimit). 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order 
by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262699879
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java
 ##
 @@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
+
+/**
+ * Planner rule that removes
+ * a {@link 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}.
+ * Note that this is different from HiveSortRemoveRule because this is not 
based on statistics
+ */
+public class HiveSortLimitRemoveRule extends RelOptRule {
+
 
 Review comment:
   Create static final INSTANCE(s) for the rule to avoid multiple 
instantiations.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order 
by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262715669
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java
 ##
 @@ -1925,6 +1926,11 @@ public RelNode apply(RelOptCluster cluster, 
RelOptSchema relOptSchema, SchemaPlu
 perfLogger.PerfLogEnd(this.getClass().getName(), PerfLogger.OPTIMIZER, 
"Calcite: Window fixing rule");
   }
 
+  perfLogger.PerfLogBegin(this.getClass().getName(), PerfLogger.OPTIMIZER);
 
 Review comment:
   I believe we may still compute aggregations that are not used in the plan 
output. Could include the following test to confirm:
   ```
   explain cbo
   SELECT COUNT(*) FROM t1 ORDER BY SUM(col), COUNT(*);
   ```
   I believe this is not critical, but we can add the test and a note to the q 
file, and create a placeholder JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order 
by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262700363
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSortLimitRemoveRule.java
 ##
 @@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.ql.optimizer.calcite.rules;
+
+import org.apache.calcite.plan.RelOptRule;
+import org.apache.calcite.plan.RelOptRuleCall;
+import org.apache.calcite.plan.RelOptRuleOperand;
+import org.apache.hadoop.hive.ql.optimizer.calcite.HiveRelOptUtil;
+import org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit;
+
+/**
+ * Planner rule that removes
+ * a {@link 
org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveSortLimit}.
+ * Note that this is different from HiveSortRemoveRule because this is not 
based on statistics
+ */
+public class HiveSortLimitRemoveRule extends RelOptRule {
+
+  //~ Constructors ---
+
+  public HiveSortLimitRemoveRule() {
+this(operand(HiveSortLimit.class, any()));
+  }
+
+  private HiveSortLimitRemoveRule(RelOptRuleOperand operand) {
+super(operand);
+  }
+
+  //~ Methods 
+
+  @Override
+  public boolean matches(RelOptRuleCall call) {
+final HiveSortLimit sortLimit = call.rel(0);
+
+return HiveRelOptUtil.produceAtmostOneRow(sortLimit.getInput());
 
 Review comment:
   Instead of unwrapping HepRelVertex nodes using this method, the rule should 
have two variants, similar to other rules: one that matches 
SortLimit-Project-Aggregate and one that matches SortLimit-Aggregate.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] jcamachor commented on a change in pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
jcamachor commented on a change in pull request #557: HIVE-21338 Remove order 
by and limit for aggregates
URL: https://github.com/apache/hive/pull/557#discussion_r262705631
 
 

 ##
 File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOptUtil.java
 ##
 @@ -1047,4 +1052,37 @@ public static String toJsonString(final RelNode rel) {
 return planWriter.asString();
   }
 
+
+  /**
+   * Utility method to answer if given a rel plan it will produce at most
+   *  one row.
+   */
+  public static boolean produceAtmostOneRow(RelNode rel) {
+if(rel instanceof HepRelVertex) {
+  rel = ((HepRelVertex)rel).getCurrentRel();
+}
+if(rel instanceof HiveProject) {
+  if(((HiveProject)rel).hasWindowingExpr()) {
 
 Review comment:
   This check does not seem to be needed since the window expression will not 
alter the number of rows output by the Project.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] vineetgarg02 opened a new pull request #557: HIVE-21338 Remove order by and limit for aggregates

2019-03-05 Thread GitBox
vineetgarg02 opened a new pull request #557: HIVE-21338 Remove order by and 
limit for aggregates
URL: https://github.com/apache/hive/pull/557
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HIVE-21395) Refactor HiveSemiJoinRule

2019-03-05 Thread Vineet Garg (JIRA)
Vineet Garg created HIVE-21395:
--

 Summary: Refactor HiveSemiJoinRule
 Key: HIVE-21395
 URL: https://issues.apache.org/jira/browse/HIVE-21395
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 4.0.0
Reporter: Vineet Garg
Assignee: Vineet Garg


Following refactoring needs to be done:
* Update the rule matching pattern to avoid using HepVertex
* HIVE-21338 adds logic to determine if rel plan will produce at most one row. 
Use this in HiveSemiJoinRule



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21394) Do not use PTF in case of select distinct having all the non aggregate column in over

2019-03-05 Thread Miklos Gergely (JIRA)
Miklos Gergely created HIVE-21394:
-

 Summary: Do not use PTF in case of select distinct having all the 
non aggregate column in over
 Key: HIVE-21394
 URL: https://issues.apache.org/jira/browse/HIVE-21394
 Project: Hive
  Issue Type: Improvement
Reporter: Miklos Gergely


Having a select distinct query where all the non aggregate columns are 
enumerated in the over clause makes the over clause unnecessary. Yet the 
CalciltePlanner leaves it in the query after optimization.

So for example in the query

SELECT DISTINCT count(x) OVER () FROM t

the over clause is meaningless. Still it remains there after Calcite 
optimization.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21393) Refactor exception catching in HiveAccumuloTableInputFormat

2019-03-05 Thread Laszlo Bodor (JIRA)
Laszlo Bodor created HIVE-21393:
---

 Summary: Refactor exception catching in 
HiveAccumuloTableInputFormat
 Key: HIVE-21393
 URL: https://issues.apache.org/jira/browse/HIVE-21393
 Project: Hive
  Issue Type: Bug
Reporter: Laszlo Bodor






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] [hive] dlavati-hw removed a comment on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex

2019-03-05 Thread GitBox
dlavati-hw removed a comment on issue #553: [HIVE-13482][UDF] Explicitly define 
str_to_map args as regex
URL: https://github.com/apache/hive/pull/553#issuecomment-469625477
 
 
   @MichaelChirico for a more verbose explanation you can refer to 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] dlavati commented on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex

2019-03-05 Thread GitBox
dlavati commented on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map 
args as regex
URL: https://github.com/apache/hive/pull/553#issuecomment-469625691
 
 
   @MichaelChirico for a more verbose explanation you can refer to 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] dlavati-hw commented on issue #553: [HIVE-13482][UDF] Explicitly define str_to_map args as regex

2019-03-05 Thread GitBox
dlavati-hw commented on issue #553: [HIVE-13482][UDF] Explicitly define 
str_to_map args as regex
URL: https://github.com/apache/hive/pull/553#issuecomment-469625477
 
 
   @MichaelChirico for a more verbose explanation you can refer to 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] coder-chenzhi opened a new pull request #556: HIVE-21392 Fix misconfigurations of DataNucleus log in log4j.properties

2019-03-05 Thread GitBox
coder-chenzhi opened a new pull request #556: HIVE-21392 Fix misconfigurations 
of DataNucleus log in log4j.properties
URL: https://github.com/apache/hive/pull/556
 
 
   HIVE-21392 Misconfigurations of DataNucleus log in log4j.properties
   
   In the patch of 
[HIVE-12020](https://issues.apache.org/jira/browse/HIVE-12020), we changed the 
DataNucleus related logging configuration from nine fine-grained loggers with 
three coarse-grained loggers (DataNucleus, Datastore and JPOX). 
   
   As Prasanth Jayachandran 
[explain](https://issues.apache.org/jira/browse/HIVE-12020?focusedCommentId=15025612=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15025612),
 these three loggers are the top-level logger in DataNucleus, so that we don't 
need to specify other loggers for DataNucleus. 
   
   However, according to the 
[documents](http://www.datanucleus.org/products/accessplatform/logging.html) 
and [source 
codes](https://github.com/datanucleus/datanucleus-core/blob/master/src/main/java/org/datanucleus/util/NucleusLogger.java#L108)
 of DataNucleus, the top-level logger in DataNucleus is `DataNucleus`. 
Therefore, we just need to keep the right one.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262415167
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##
 @@ -534,6 +536,90 @@ public void 
bootstrapExternalTablesDuringIncrementalPhase() throws Throwable {
 .verifyResults(Arrays.asList("10", "20"));
   }
 
+  @Test
+  public void retryBootstrapExternalTablesFromDifferentDump() throws Throwable 
{
+List loadWithClause = new ArrayList<>();
+loadWithClause.addAll(externalTableBasePathWithClause());
+
+List dumpWithClause = Collections.singletonList(
+"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + 
"'='false'"
+);
+
+WarehouseInstance.Tuple tupleBootstrapWithoutExternal = primary
+.run("use " + primaryDbName)
+.run("create external table t1 (id int)")
+.run("insert into table t1 values (1)")
+.run("create external table t2 (place string) partitioned by 
(country string)")
+.run("insert into table t2 partition(country='india') values 
('bangalore')")
+.run("insert into table t2 partition(country='us') values 
('austin')")
+.run("create table t3 as select * from t1")
+.dump(primaryDbName, null, dumpWithClause);
+
+replica.load(replicatedDbName, tupleBootstrapWithoutExternal.dumpLocation, 
loadWithClause)
+.status(replicatedDbName)
+.verifyResult(tupleBootstrapWithoutExternal.lastReplicationId)
+.run("use " + replicatedDbName)
+.run("show tables")
+.verifyResult("t3")
+.run("select id from t3")
+.verifyResult("1");
+
+dumpWithClause = Arrays.asList("'" + 
HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
+"'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + 
"'='true'");
+WarehouseInstance.Tuple tupleIncWithExternalBootstrap = primary.run("use " 
+ primaryDbName)
+.run("drop table t1")
+.run("create external table t4 (id int)")
+.run("insert into table t4 values (10)")
+.run("create table t5 as select * from t4")
+.dump(primaryDbName, 
tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause);
+
+// Verify if bootstrapping with same dump is idempotent and return same 
result
+for (int i = 0; i < 2; i++) {
+  replica.load(replicatedDbName, 
tupleIncWithExternalBootstrap.dumpLocation, loadWithClause)
+  .status(replicatedDbName)
+  .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId)
+  .run("use " + replicatedDbName)
+  .run("show tables like 't1'")
+  .verifyFailure(new String[]{"t1"})
+  .run("select place from t2 where country = 'us'")
+  .verifyResult("austin")
+  .run("select id from t4")
+  .verifyResult("10")
+  .run("select id from t5")
+  .verifyResult("10");
+}
+
+// Drop an external table, add another managed table with same name, 
insert into existing external table
+// and dump another bootstrap dump for external tables.
+WarehouseInstance.Tuple tupleNewIncWithExternalBootstrap = 
primary.run("use " + primaryDbName)
+.run("insert into table t2 partition(country='india') values 
('chennai')")
+.run("drop table t2")
+.run("create table t2 as select * from t4")
+.run("insert into table t4 values (20)")
+.dump(primaryDbName, 
tupleIncWithExternalBootstrap.lastReplicationId, dumpWithClause);
+
+// Set previous dump as bootstrap to be rolled-back. Now, new bootstrap 
should overwrite the old one.
+loadWithClause.add("'" + REPL_ROLLBACK_BOOTSTRAP_LOAD_CONFIG + "'='"
++ tupleIncWithExternalBootstrap.dumpLocation + "'");
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262415069
 
 

 ##
 File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/ReplLoadTask.java
 ##
 @@ -279,6 +292,72 @@ a database ( directory )
 return 0;
   }
 
+  /**
+   * Cleanup/drop tables from the given database which are bootstrapped by 
input dump dir.
+   * @throws HiveException Failed to drop the tables.
+   * @throws IOException File operations failure.
+   * @throws InvalidInputException Invalid input dump directory.
+   */
+  private void bootstrapRollbackTask() throws HiveException, IOException, 
InvalidInputException {
+Path bootstrapDirectory = new PathBuilder(work.bootstrapDumpToRollback)
+.addDescendant(ReplUtils.INC_BOOTSTRAP_ROOT_DIR_NAME).build();
+FileSystem fs = bootstrapDirectory.getFileSystem(conf);
+
+if (!fs.exists(bootstrapDirectory)) {
+  throw new InvalidInputException("Input bootstrap dump directory to 
rollback doesn't exist: "
+  + bootstrapDirectory);
 
 Review comment:
   Done,


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HIVE-21392) Misconfigurations of DataNucleus log in log4j.properties

2019-03-05 Thread Chen Zhi (JIRA)
Chen Zhi created HIVE-21392:
---

 Summary: Misconfigurations of DataNucleus log in log4j.properties
 Key: HIVE-21392
 URL: https://issues.apache.org/jira/browse/HIVE-21392
 Project: Hive
  Issue Type: Improvement
  Components: Logging
Affects Versions: 2.0.0
Reporter: Chen Zhi


In the patch of  [HIVE-12020|https://issues.apache.org/jira/browse/HIVE-12020], 
we changed the DataNucleus related logging configuration from nine fine-grained 
loggers with three coarse-grained loggers (DataNucleus, Datastore and JPOX). As 
Prasanth Jayachandran 
[explain|https://issues.apache.org/jira/browse/HIVE-12020?focusedCommentId=15025612=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15025612],
 these three loggers are the top-level logger in DataNucleus, so that we don't 
need to specify other loggers for DataNucleus. However, according to the 
[documents|http://www.datanucleus.org/products/accessplatform/logging.html] and 
[source 
codes|https://github.com/datanucleus/datanucleus-core/blob/master/src/main/java/org/datanucleus/util/NucleusLogger.java#L108]
 of DataNucleus, the top-level logger in DataNucleus is `DataNucleus`. 
Therefore, we just need to keep the right one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] [hive] sankarh commented on a change in pull request #551: HIVE-21286: Hive should support clean-up of previously bootstrapped tables when retry from different dump.

2019-03-05 Thread GitBox
sankarh commented on a change in pull request #551: HIVE-21286: Hive should 
support clean-up of previously bootstrapped tables when retry from different 
dump.
URL: https://github.com/apache/hive/pull/551#discussion_r262403171
 
 

 ##
 File path: 
itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/parse/TestReplicationScenariosExternalTables.java
 ##
 @@ -534,6 +536,90 @@ public void 
bootstrapExternalTablesDuringIncrementalPhase() throws Throwable {
 .verifyResults(Arrays.asList("10", "20"));
   }
 
+  @Test
+  public void retryBootstrapExternalTablesFromDifferentDump() throws Throwable 
{
+List loadWithClause = new ArrayList<>();
+loadWithClause.addAll(externalTableBasePathWithClause());
+
+List dumpWithClause = Collections.singletonList(
+"'" + HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + 
"'='false'"
+);
+
+WarehouseInstance.Tuple tupleBootstrapWithoutExternal = primary
+.run("use " + primaryDbName)
+.run("create external table t1 (id int)")
+.run("insert into table t1 values (1)")
+.run("create external table t2 (place string) partitioned by 
(country string)")
+.run("insert into table t2 partition(country='india') values 
('bangalore')")
+.run("insert into table t2 partition(country='us') values 
('austin')")
+.run("create table t3 as select * from t1")
+.dump(primaryDbName, null, dumpWithClause);
+
+replica.load(replicatedDbName, tupleBootstrapWithoutExternal.dumpLocation, 
loadWithClause)
+.status(replicatedDbName)
+.verifyResult(tupleBootstrapWithoutExternal.lastReplicationId)
+.run("use " + replicatedDbName)
+.run("show tables")
+.verifyResult("t3")
+.run("select id from t3")
+.verifyResult("1");
+
+dumpWithClause = Arrays.asList("'" + 
HiveConf.ConfVars.REPL_INCLUDE_EXTERNAL_TABLES.varname + "'='true'",
+"'" + HiveConf.ConfVars.REPL_BOOTSTRAP_EXTERNAL_TABLES.varname + 
"'='true'");
+WarehouseInstance.Tuple tupleIncWithExternalBootstrap = primary.run("use " 
+ primaryDbName)
+.run("drop table t1")
+.run("create external table t4 (id int)")
+.run("insert into table t4 values (10)")
+.run("create table t5 as select * from t4")
+.dump(primaryDbName, 
tupleBootstrapWithoutExternal.lastReplicationId, dumpWithClause);
+
+// Verify if bootstrapping with same dump is idempotent and return same 
result
+for (int i = 0; i < 2; i++) {
+  replica.load(replicatedDbName, 
tupleIncWithExternalBootstrap.dumpLocation, loadWithClause)
+  .status(replicatedDbName)
+  .verifyResult(tupleIncWithExternalBootstrap.lastReplicationId)
+  .run("use " + replicatedDbName)
+  .run("show tables like 't1'")
+  .verifyFailure(new String[]{"t1"})
+  .run("select place from t2 where country = 'us'")
+  .verifyResult("austin")
+  .run("select id from t4")
+  .verifyResult("10")
+  .run("select id from t5")
+  .verifyResult("10");
+}
+
+// Drop an external table, add another managed table with same name, 
insert into existing external table
+// and dump another bootstrap dump for external tables.
+WarehouseInstance.Tuple tupleNewIncWithExternalBootstrap = 
primary.run("use " + primaryDbName)
+.run("insert into table t2 partition(country='india') values 
('chennai')")
+.run("drop table t2")
+.run("create table t2 as select * from t4")
+.run("insert into table t4 values (20)")
+.dump(primaryDbName, 
tupleIncWithExternalBootstrap.lastReplicationId, dumpWithClause);
+
+// Set previous dump as bootstrap to be rolled-back. Now, new bootstrap 
should overwrite the old one.
+loadWithClause.add("'" + REPL_ROLLBACK_BOOTSTRAP_LOAD_CONFIG + "'='"
++ tupleIncWithExternalBootstrap.dumpLocation + "'");
 
 Review comment:
   Yes... will modify the current test case to fail the bootstrap in first 
place before retry.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HIVE-21391) LLAP: Pool of column vector buffers can cause memory pressure

2019-03-05 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-21391:


 Summary: LLAP: Pool of column vector buffers can cause memory 
pressure
 Key: HIVE-21391
 URL: https://issues.apache.org/jira/browse/HIVE-21391
 Project: Hive
  Issue Type: Bug
  Components: llap
Affects Versions: 4.0.0, 3.2.0
Reporter: Prasanth Jayachandran


Where there are too many columns (in the order of 100s), with decimal, string 
types the column vector pool of buffers created here 
[https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/io/decode/EncodedDataConsumer.java#L59]
 can cause memory pressure. 

Example:

128 (poolSize) * 300 (numCols) * 1024 (batchSize) * 80 (decimalSize) ~= 3GB

The pool size keeps increasing when there is slow consumer but fast llap io 
(SSDs) leading to GC pressure when all LLAP io threads read splits from same 
table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21390) BI split strategy does work for blob stores

2019-03-05 Thread Prasanth Jayachandran (JIRA)
Prasanth Jayachandran created HIVE-21390:


 Summary: BI split strategy does work for blob stores
 Key: HIVE-21390
 URL: https://issues.apache.org/jira/browse/HIVE-21390
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0, 3.2.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran


BI split strategy cuts the split at block boundaries however there are no block 
boundaries in blob storage so we end up with 1 split for BI split strategy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)