date:20171018

[GitHub] drill pull request #1002: DRILL-5888: Remove dependency of SSLConfig on hado...

2017-10-18 Thread parthchandra

Github user parthchandra closed the pull request at:

https://github.com/apache/drill/pull/1002


---

[GitHub] drill pull request #1002: DRILL-5888: Remove dependency of SSLConfig on hado...

2017-10-18 Thread parthchandra

GitHub user parthchandra reopened a pull request:

https://github.com/apache/drill/pull/1002

DRILL-5888: Remove dependency of SSLConfig on hadoop.security packageâ¦

â¦. This allows jdbc-all-jar to be built without hadoop dependencies.

The original  code used Hadoop's SSLFactory.Mode enum. The same enum is now 
replicated in 
the SSLConfig class. 

@arina-ielchiieva  please review 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/parthchandra/drill DRILL-5888

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1002.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1002


commit cded98dafa6fc548d140b15c36ecda1b560fb75c
Author: Parth Chandra 
Date:   2017-10-18T00:14:20Z

DRILL-5888: Remove dependency of SSLConfig on hadoop.security package. This 
allows jdbc-all-jar to be built without hadoop dependencies




---

[GitHub] drill issue #1002: DRILL-5888: Remove dependency of SSLConfig on hadoop.secu...

2017-10-18 Thread parthchandra

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1002
  
Yes. The property is consumed by SSLConfig which uses that to construct a 
key name that is looked up in the hadoop config.


---

[GitHub] drill pull request #998: DRILL-5887: Display process user/groups info in Dri...

2017-10-18 Thread prasadns14

Github user prasadns14 commented on a diff in the pull request:

https://github.com/apache/drill/pull/998#discussion_r145584186
  
--- Diff: exec/java-exec/src/main/resources/rest/index.ftl ---
@@ -112,6 +112,14 @@
 Admin User Groups
 ${model.getAdminUserGroups()}
   
+  
+Process User
+${model.getProcessUser()}
--- End diff --

yes freemarker handles this


---

[GitHub] drill pull request #998: DRILL-5887: Display process user/groups info in Dri...

2017-10-18 Thread prasadns14

Github user prasadns14 commented on a diff in the pull request:

https://github.com/apache/drill/pull/998#discussion_r145584181
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRoot.java 
---
@@ -85,21 +86,33 @@ public ClusterInfo getClusterInfoJSON() {
 // For all other cases the user info need-not or should-not be 
displayed
 OptionManager optionManager = work.getContext().getOptionManager();
 final boolean isUserLoggedIn = AuthDynamicFeature.isUserLoggedIn(sc);
+final String processUser = isUserLoggedIn ?
+ImpersonationUtil.getProcessUserName() : null;
+final String processUserGroups = isUserLoggedIn ?
+
Joiner.on(",").join(ImpersonationUtil.getProcessUserGroupNames()) : null;
--- End diff --

This list is obtained from hadoop.security.UserGroupInformation and is not 
the user option. hence we need not worry about the space 


---

[GitHub] drill pull request #998: DRILL-5887: Display process user/groups info in Dri...

2017-10-18 Thread prasadns14

Github user prasadns14 commented on a diff in the pull request:

https://github.com/apache/drill/pull/998#discussion_r145584184
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRoot.java 
---
@@ -85,21 +86,33 @@ public ClusterInfo getClusterInfoJSON() {
 // For all other cases the user info need-not or should-not be 
displayed
 OptionManager optionManager = work.getContext().getOptionManager();
 final boolean isUserLoggedIn = AuthDynamicFeature.isUserLoggedIn(sc);
+final String processUser = isUserLoggedIn ?
+ImpersonationUtil.getProcessUserName() : null;
+final String processUserGroups = isUserLoggedIn ?
+
Joiner.on(",").join(ImpersonationUtil.getProcessUserGroupNames()) : null;
 String adminUsers = isUserLoggedIn ?
 
ExecConstants.ADMIN_USERS_VALIDATOR.getAdminUsers(optionManager) : null;
 String adminUserGroups = isUserLoggedIn ?
 
ExecConstants.ADMIN_USER_GROUPS_VALIDATOR.getAdminUserGroups(optionManager) : 
null;
 
 // separate groups by comma + space
 if (adminUsers != null) {
-  String[] groups = adminUsers.split(",");
-  adminUsers = Joiner.on(", ").join(groups);
+  if (adminUsers.length() == 0) {
+adminUsers = "";
--- End diff --

yes freemarker handles this


---

[GitHub] drill pull request #998: DRILL-5887: Display process user/groups info in Dri...

2017-10-18 Thread prasadns14

Github user prasadns14 commented on a diff in the pull request:

https://github.com/apache/drill/pull/998#discussion_r145584179
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRoot.java 
---
@@ -85,21 +86,33 @@ public ClusterInfo getClusterInfoJSON() {
 // For all other cases the user info need-not or should-not be 
displayed
 OptionManager optionManager = work.getContext().getOptionManager();
 final boolean isUserLoggedIn = AuthDynamicFeature.isUserLoggedIn(sc);
+final String processUser = isUserLoggedIn ?
+ImpersonationUtil.getProcessUserName() : null;
--- End diff --

From security perspective it would be good to display process user only if 
an admin user is logged in. Removed the check here as the condition is handled 
in html.


---

Re: Excessive review comments

2017-10-18 Thread Paul Rogers

With all due respect, I did start a review. Is something broken?

- Paul

> On Oct 18, 2017, at 3:36 PM, julianhyde  wrote:
> 
> Github user julianhyde commented on a diff in the pull request:
> 
>https://github.com/apache/drill/pull/984#discussion_r145561518
> 
>--- Diff: 
> exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixture.java ---
>@@ -584,11 +492,14 @@ public static void defineWorkspace(Drillbit 
> drillbit, String pluginName,
>   public static final String EXPLAIN_PLAN_TEXT = "text";
>   public static final String EXPLAIN_PLAN_JSON = "json";
> 
>-  public static FixtureBuilder builder() {
>-FixtureBuilder builder = new FixtureBuilder()
>+  public static FixtureBuilder builder(DirTestWatcher dirTestWatcher) {
>--- End diff --
> 
>Jeez Paul, please start a review rather than making single review 
> comments. I just got 39 emails from you, and so did everyone else on 
> dev@drill.
> 
> 
> ---

[jira] [Created] (DRILL-5892) Distinguish between states for query statistics exposed via JMX

2017-10-18 Thread Kunal Khatua (JIRA)

Kunal Khatua created DRILL-5892:
---

 Summary: Distinguish between states for query statistics exposed 
via JMX
 Key: DRILL-5892
 URL: https://issues.apache.org/jira/browse/DRILL-5892
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.11.0
Reporter: Kunal Khatua
 Fix For: Future


Currently, the JMX metrics exposed 
{code:java}
metrics:name=drill.queries.completed
metrics:name=drill.queries.running
metrics:name=drill.queries.enqueued
{code}

The completed queries, however, do not distinguish between the outcomes of the 
completed queries.

The proposal is to also provide success, failed, cancelled and timeout (yet to 
implement) states.
{code:xml}
  "metrics:name=drill.queries" : {
"completed" : {
  "successful": INTEGER,
  "failed": INTEGER,
  "cancelled": INTEGER,
  "timeout": INTEGER
},
"running" : INTEGER,
"enqueued" : {
  "small" : INTEGER,
  "large" : INTEGER
  }
}
{code}







--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145575103
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end-out.start)) {
+out.asciiMode = 1; // we can conclude this string is ASCII
--- End diff --

Drill likes spaces around operators: `out.end - out.start`


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145577107
  
--- Diff: exec/vector/src/main/codegen/templates/ValueHolders.java ---
@@ -33,34 +33,40 @@
  * This class is generated using freemarker and the ${.template_name} 
template.
  */
 public final class ${className} implements ValueHolder{
-  
+
   public static final MajorType TYPE = 
Types.${mode.name?lower_case}(MinorType.${minor.class?upper_case});
+  public MajorType getType() {return TYPE;}
 
 <#if mode.name == "Repeated">
-
+
 /** The first index (inclusive) into the Vector. **/
 public int start;
-
+
 /** The last index (exclusive) into the Vector. **/
 public int end;
-
+
 /** The Vector holding the actual values. **/
 public ${minor.class}Vector vector;
-
+
 <#else>
 public static final int WIDTH = ${type.width};
-
+
 <#if mode.name == "Optional">public int isSet;
 <#assign fields = minor.fields!type.fields />
 <#list fields as field>
 public ${field.type} ${field.name};
 
-
+
+<#if minor.class == "VarChar">
+// -1: unknown, 0: not ascii, 1: is ascii
+public int asciiMode = -1;
--- End diff --

Drill is complex, nothing is as it seems. Drill has a very elaborate 
mechanism to rewrite byte codes to do scalar replacement. That is, holders are 
added to the Java source during code generation, but then are removed during 
byte code transforms. (Yes, that is silly given that Java 8 does a fine job on 
its own; but it is how Drill works today...)

Will adding this variable change that behavior? Will the scalar replacement 
logic know to not replace this object? If so, will that hurt performance, or 
with the JVM go ahead and figure it out on its own? Is all the code that sets 
the variable inlined via code generation so that scalar replacement is possible?

Yes, indeed, nothing in Drill is as simple as it seems...


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145578320
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
--- End diff --

Pattern length is fixed per expression; need not be checked per row. See 
more below.


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145576718
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
+  return match_1();
+} else if (patternLength == 2) {
+  return match_2();
+} else if (patternLength == 3) {
+  return match_3();
+} else {
+  return match_N();
+}
+  }
+
+  private final int match_1() {
--- End diff --

See note about UTF-8. If we don't care about the match position (that is, 
we don't need `strpos()`, and all we care is whether it matches or not, then we 
can do the work on the undecoded UTF-8 bytes, saving a large amount of 
complexity.


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145577808
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
+  return match_1();
+} else if (patternLength == 2) {
+  return match_2();
+} else if (patternLength == 3) {
+  return match_3();
+} else {
+  return match_N();
+}
+  }
+
+  private final int match_1() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length();
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  }
+  return 1;
+}
+return 0;
+  }
+
+  private final int match_2() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length() - 1;
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  } else {
+char ch2_1 = sequenceWrapper.charAt(idx+1);
+char ch2_2 = patternString.charAt(1);
--- End diff --

We want speed. Instead of getting the second character multiple times, is 
it better to get it once up front? I suppose that depends on the hit rate. 
Average hit rate may be 2 % (1/ ~64). So if our input is smaller than 64 
characters, we'll have, on average one hit so we pay the second character cost 
once. At 128 or above, we'll pay the cost two or more times. But, maybe the JVM 
can optimize away the second and subsequent accesses?

Actually, let's take a step back. The pattern is fixed. We parsed the 
pattern to decide to use this particular class. Should we instead create a 
1-char, 2-char and n-char matcher class so we get the second character (for the 
2-char case) only once, and we eliminate the extra per-value if-check?


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145574785
  
--- Diff: 
exec/java-exec/src/main/codegen/templates/CastFunctionsSrcVarLenTargetVarLen.java
 ---
@@ -73,6 +73,9 @@ public void eval() {
 out.start =  in.start;
 if (charCount <= length.value || length.value == 0 ) {
   out.end = in.end;
+  if (charCount == (out.end-out.start)) {
+out.asciiMode = 1; // we can conclude this string is ASCII
--- End diff --

The values -1, 0, and 1 are magic numbers here. Because we want speed, we 
don't want to use an enum. But, can we define symbols for our values: 
`IS_ASCII`, `NOT_ASCII`, `ASCII_UNKNOWN`? (You decide on the names...)


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1001#discussion_r145577894
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -17,36 +17,133 @@
  */
 package org.apache.drill.exec.expr.fn.impl;
 
-public class SqlPatternContainsMatcher implements SqlPatternMatcher {
+public final class SqlPatternContainsMatcher implements SqlPatternMatcher {
   final String patternString;
   CharSequence charSequenceWrapper;
   final int patternLength;
 
   public SqlPatternContainsMatcher(String patternString, CharSequence 
charSequenceWrapper) {
-this.patternString = patternString;
+this.patternString   = patternString;
 this.charSequenceWrapper = charSequenceWrapper;
-patternLength = patternString.length();
+patternLength= patternString.length();
   }
 
   @Override
-  public int match() {
-final int txtLength = charSequenceWrapper.length();
-int patternIndex = 0;
-int txtIndex = 0;
+  public final int match() {
+// The idea is to write loops with simple condition checks to allow 
the Java Hotspot vectorize
+// the generate code.
+if (patternLength == 1) {
+  return match_1();
+} else if (patternLength == 2) {
+  return match_2();
+} else if (patternLength == 3) {
+  return match_3();
+} else {
+  return match_N();
+}
+  }
+
+  private final int match_1() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length();
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  }
+  return 1;
+}
+return 0;
+  }
+
+  private final int match_2() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length() - 1;
+final char first_patt_char = patternString.charAt(0);
+
+// simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+// so, we can just directly compare.
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
+  } else {
+char ch2_1 = sequenceWrapper.charAt(idx+1);
+char ch2_2 = patternString.charAt(1);
+
+if (ch2_1 == ch2_2) {
+  return 1;
+}
+  }
+}
+return 0;
+  }
+
+  private final int match_3() {
+final CharSequence sequenceWrapper = charSequenceWrapper;
+final int lengthToProcess  = sequenceWrapper.length() -2;
+final char first_patt_char = patternString.charAt(0);
 
 // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
 // so, we can just directly compare.
-while (patternIndex < patternLength && txtIndex < txtLength) {
-  if (patternString.charAt(patternIndex) != 
charSequenceWrapper.charAt(txtIndex)) {
-// Go back if there is no match
-txtIndex = txtIndex - patternIndex;
-patternIndex = 0;
+for (int idx = 0; idx < lengthToProcess; idx++) {
+  char input_char = sequenceWrapper.charAt(idx);
+
+  if (first_patt_char != input_char) {
+continue;
   } else {
-patternIndex++;
+char ch2_1 = sequenceWrapper.charAt(idx+1);
+char ch2_2 = patternString.charAt(1);
+char ch3_1 = sequenceWrapper.charAt(idx+2);
+char ch3_2 = patternString.charAt(2);
+
+if (ch2_1 == ch2_2 && ch3_1 == ch3_2) {
+  return 1;
+}
   }
-  txtIndex++;
+}
+return 0;
+  }
+
+  private final int match_N() {
+
+if (patternLength == 0) {
--- End diff --

Can't this be optimized away in the pattern parser stage? We should not be 
calling this function if we had a zero-length pattern.


---

[GitHub] drill issue #997: DRILL-5882: C++ Client: [Threat Modeling] Drillbit may be ...

2017-10-18 Thread parthchandra

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/997
  
+1. 


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145575975
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/TestAggNullable.java ---
@@ -21,29 +21,25 @@
 
 import org.apache.drill.categories.OperatorTest;
 import org.apache.drill.common.util.TestTools;
+import org.apache.drill.test.BaseTestQuery;
 import org.junit.Test;
 import org.junit.experimental.categories.Category;
 
 @Category(OperatorTest.class)
-public class TestAggNullable extends BaseTestQuery{
+public class TestAggNullable extends BaseTestQuery {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(TestAggNullable.class);
 
-  static final String WORKING_PATH = TestTools.getWorkingPath();
-  static final String TEST_RES_PATH = WORKING_PATH + "/src/test/resources";
-
   private static void enableAggr(boolean ha, boolean sa) throws Exception {
 
-test(String.format("alter session set `planner.enable_hashagg` = %s", 
ha ? "true":"false"));
-test(String.format("alter session set `planner.enable_streamagg` = 
%s", sa ? "true":"false"));
+test("alter session set `planner.enable_hashagg` = %s", ha);
+test("alter session set `planner.enable_streamagg` = %s", sa);
--- End diff --

We can update these after the other PR goes in then.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145575781
  
--- Diff: exec/java-exec/src/main/resources/drill-module.conf ---
@@ -247,7 +248,7 @@ drill.exec: {
 }
   },
   sort: {
-purge.threshold : 10,
+purge.threshold : 1000,
--- End diff --

Oh my. Thanks for catching that!


---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Paul,

- I think you misunderstood the proposal
- Let me use an example
- select .. c1 like '%pattern1%' OR c1 like '%pattern2'..
- Assume c1 has 3 values [v1, v2, v3]
- The generated code will create a new VarcharHolder instance for each 
value iteration
- For v1: VarCharHolder vc1 is created, ascii-mode computed for pattern1, 
ascii-mode computation reused for pattern2, pattern3, etc since we're 
evaluating the same value
- For v2: VarCharHolder vc1 is created, ascii-mode computed for pattern1, 
ascii-mode computation reused for pattern2, pattern3, etc since we're 
evaluating the same value
- DITO for v3

Note that the test-suite has similar test cases that you are proposing; 
they all passed.  


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145575588
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java
 ---
@@ -116,6 +116,14 @@ public void clear() {
 }
   }
 
+  public static int getBatchIndex(int sv4Index) {
+return (sv4Index >> 16) & 0x;
+  }
+
+  public static int getRecordIndex(int sv4Index) {
+return (sv4Index) & 0x;
+  }
--- End diff --

Didn't do any performance testing. I mainly intended to use these methods 
for testing purposes. For example I use them in 
org.apache.drill.test.BatchUtils .


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145575310
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/TopNBatch.java
 ---
@@ -335,20 +336,42 @@ private void purge() throws SchemaChangeException {
 logger.debug("Took {} us to purge", 
watch.elapsed(TimeUnit.MICROSECONDS));
   }
 
-  public PriorityQueue createNewPriorityQueue(FragmentContext context, 
List orderings,
- VectorAccessible 
batch, MappingSet mainMapping, MappingSet leftMapping, MappingSet rightMapping)
-  throws ClassTransformationException, IOException, 
SchemaChangeException{
-CodeGenerator cg = 
CodeGenerator.get(PriorityQueue.TEMPLATE_DEFINITION, 
context.getFunctionRegistry(), context.getOptions());
+  private PriorityQueue createNewPriorityQueue(VectorAccessible batch, int 
limit)
+throws SchemaChangeException, ClassTransformationException, 
IOException {
+return createNewPriorityQueue(
+  mainMapping, leftMapping, rightMapping, context.getOptionSet(), 
context.getFunctionRegistry(), context.getDrillbitContext().getCompiler(),
+  config.getOrderings(), batch, unionTypeEnabled, codegenDump, limit, 
oContext.getAllocator(), schema.getSelectionVectorMode());
+  }
+
+  public static MappingSet createMainMappingSet() {
+return new MappingSet((String) null, null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createLeftMappingSet() {
+return new MappingSet("leftIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createRightMappingSet() {
+return new MappingSet("rightIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static PriorityQueue createNewPriorityQueue(
+MappingSet mainMapping, MappingSet leftMapping, MappingSet 
rightMapping,
--- End diff --

Agreed. Pritesh has asked me to look at Codegen improvements in the long 
term, so I will work on this down the line.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145574915
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/TopNBatch.java
 ---
@@ -335,20 +336,42 @@ private void purge() throws SchemaChangeException {
 logger.debug("Took {} us to purge", 
watch.elapsed(TimeUnit.MICROSECONDS));
   }
 
-  public PriorityQueue createNewPriorityQueue(FragmentContext context, 
List orderings,
- VectorAccessible 
batch, MappingSet mainMapping, MappingSet leftMapping, MappingSet rightMapping)
-  throws ClassTransformationException, IOException, 
SchemaChangeException{
-CodeGenerator cg = 
CodeGenerator.get(PriorityQueue.TEMPLATE_DEFINITION, 
context.getFunctionRegistry(), context.getOptions());
+  private PriorityQueue createNewPriorityQueue(VectorAccessible batch, int 
limit)
+throws SchemaChangeException, ClassTransformationException, 
IOException {
+return createNewPriorityQueue(
+  mainMapping, leftMapping, rightMapping, context.getOptionSet(), 
context.getFunctionRegistry(), context.getDrillbitContext().getCompiler(),
+  config.getOrderings(), batch, unionTypeEnabled, codegenDump, limit, 
oContext.getAllocator(), schema.getSelectionVectorMode());
+  }
+
+  public static MappingSet createMainMappingSet() {
+return new MappingSet((String) null, null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createLeftMappingSet() {
+return new MappingSet("leftIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createRightMappingSet() {
+return new MappingSet("rightIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static PriorityQueue createNewPriorityQueue(
+MappingSet mainMapping, MappingSet leftMapping, MappingSet 
rightMapping,
--- End diff --

We cannot declare the MappingSets as constants because they contain some 
internal state (the mappingIndex field). If they were constants the state could 
become corrupt in the case where multiple queries run and create PriorityQueues 
on the same node simultaneously, so we have to create new MappingSets each time 
we create a new priority queue.


---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
- It was 50% for 1) and 50% for 2)
- Notice this breakdown depends on 
   o The number of Contains pattern for the same value (impacts 1))
   o The pattern length (impacts both 1) and 2))  


---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1001
  
@sachouche, thanks for the first PR to Drill! Thanks for the detailed 
explanation!

Before reviewing the code, a comment on the design:

> Added a new integer variable "asciiMode" ... this value will be set ... 
during the first LIKE evaluation and will be reused across other LIKE 
evaluations

The problem with this design is that there is no guarantee that the first 
value is representative of the other columns. Maybe my list looks like this:

```
Hello
ä½ å¥½
```

The first value is ASCII. The second is not. So, we must treat each value 
as independent of the others.

On the other hand, we *can* exploit the nature of UTF-8. The encoding is 
such that no valid UTF-8 character is a prefix of any other valid character. 
Thus, if a character is 0xXX 0xYY 0xZZ, then there can *never* be a valid 
character which is 0xXX 0xYY. As a result, starts-with, ends-width, equals and 
contains can be done without either converting to UTF-16 or even caring if the 
data is ASCII or not.

What does this mean? It means that, for the simple operations:

1. Convert the Java UTF-16 string to UTF-8.
2. Do the classic byte comparison methods for starts with, ends with or 
contains. No special processing is needed for multi-byte

Unlike other multi-byte encodings, UTF-8 was designed to make this possible.

If we go this route, we would not need the ASCII mode flag.

Note: all of this applies only to the "basic four" operations: if we do a 
real regex, then we must decode the Varchar into a Java UTF-16 string.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145574026
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/PriorityQueue.java
 ---
@@ -20,22 +20,58 @@
 import org.apache.drill.exec.compile.TemplateClassDefinition;
 import org.apache.drill.exec.exception.SchemaChangeException;
 import org.apache.drill.exec.memory.BufferAllocator;
-import org.apache.drill.exec.ops.FragmentContext;
 import org.apache.drill.exec.physical.impl.sort.RecordBatchData;
 import org.apache.drill.exec.record.VectorContainer;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 
 public interface PriorityQueue {
-  public void add(FragmentContext context, RecordBatchData batch) throws 
SchemaChangeException;
-  public void init(int limit, FragmentContext context, BufferAllocator 
allocator, boolean hasSv2) throws SchemaChangeException;
-  public void generate() throws SchemaChangeException;
-  public VectorContainer getHyperBatch();
-  public SelectionVector4 getHeapSv4();
-  public SelectionVector4 getFinalSv4();
-  public boolean validate();
-  public void resetQueue(VectorContainer container, SelectionVector4 
vector4) throws SchemaChangeException;
-  public void cleanup();
-
-  public static TemplateClassDefinition TEMPLATE_DEFINITION 
= new TemplateClassDefinition(PriorityQueue.class, 
PriorityQueueTemplate.class);
+  /**
+   * The elements in the given batch are added to the priority queue. Note 
that the priority queue
+   * only retains the top elements that fit within the size specified by 
the {@link #init(int, BufferAllocator, boolean)}
+   * method.
+   * @param batch The batch containing elements we want to add.
+   * @throws SchemaChangeException
+   */
+  void add(RecordBatchData batch) throws SchemaChangeException;
 
+  /**
+   * Initializes the priority queue. This method must be called before any 
other methods on the priority
+   * queue are called.
+   * @param limit The size of the priority queue.
+   * @param allocator The {@link BufferAllocator} to use when creating the 
priority queue.
+   * @param hasSv2 True when incoming batches have v2 selection vectors. 
False otherwise.
+   * @throws SchemaChangeException
+   */
+  void init(int limit, BufferAllocator allocator, boolean hasSv2) throws 
SchemaChangeException;
+
+  /**
+   * This method must be called before fetching the final heap hyper batch 
and final Sv4 vector.
+   * @throws SchemaChangeException
+   */
+  void generate() throws SchemaChangeException;
+
+  /**
+   * Retrieves the final priority queue HyperBatch containing the results. 
Note: this should be called
+   * after {@link #generate()}.
+   * @return The final priority queue HyperBatch containing the results.
+   */
+  VectorContainer getHyperBatch();
+
+  SelectionVector4 getSv4();
+
+  /**
+   * Retrieves the selection vector used to select the elements in the 
priority queue from the hyper batch
+   * provided by the {@link #getHyperBatch()} method. Note: this 
should be called after {@link #generate()}.
+   * @return The selection vector used to select the elements in the 
priority queue.
+   */
+  SelectionVector4 getFinalSv4();
--- End diff --

The code already works that way. There is a config option 
drill.exec.sort.purge.threshold which controls the maximum number of 
batches allowed in the hyper batch. Once the threshold is exceeded the top N 
values are consolidated into a single batch and the process is repeated. There 
is an issue in the case where the limit is large. Ex. 100,000,000 . In this 
case the operator will keep all the records in memory and die. There is a jira 
created to address this issue: 
[https://issues.apache.org/jira/browse/DRILL-5823]


---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2017-10-18 Thread ppadma

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1001
  
@sachouche Do you have a breakdown of how much gain we got with 1 vs 2.  
Since the changes for 2 are not straightforward and easy to maintain, I am 
thinking  performance gain vs. maintainability of the code. 


---

[GitHub] drill pull request #993: DRILL-5874: NPE in AnonWebUserConnection.cleanupSes...

2017-10-18 Thread sohami

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/993#discussion_r145565896
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRestServer.java
 ---
@@ -204,9 +217,15 @@ public WebUserConnection provide() {
 
config.getLong(ExecConstants.HTTP_SESSION_MEMORY_RESERVATION),
 config.getLong(ExecConstants.HTTP_SESSION_MEMORY_MAXIMUM));
 
+// Create a dummy close future which is needed by Foreman only. 
Foreman uses this future to add a close
+// listener to known about channel close event from underlying 
layer. We use this future to notify Foreman
+// listeners when the Web connection between Web Client and 
WebServer is closed. This will help Foreman to cancel
+// all the running queries for this Web Client.
--- End diff --

The query is killed when the HttpSession is invalidated either by logout or 
by timeout in case of authenticated user session. But for unauthenticated or 
anonymous user we don't kill the query after HttpSession timeout since we don't 
maintain any session related information. The query will keep running unless it 
is explicitly cancelled from /profile page.


---

[GitHub] drill pull request #993: DRILL-5874: NPE in AnonWebUserConnection.cleanupSes...

2017-10-18 Thread sohami

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/993#discussion_r145565968
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRestServer.java
 ---
@@ -260,8 +282,14 @@ public WebUserConnection provide() {
 logger.trace("Failed to get the remote address of the http session 
request", ex);
   }
 
-  final WebSessionResources webSessionResources = new 
WebSessionResources(sessionAllocator,
-  remoteAddress, drillUserSession);
+  // Create a dummy close future which is needed by Foreman only. 
Foreman uses this future to add a close
+  // listener to known about channel close event from underlying 
layer. We use this future to notify Foreman
+  // listeners when the Web connection between Web Client and 
WebServer is closed. This will help Foreman to cancel
--- End diff --

Will fix the comment


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner-mapr

Github user ilooner-mapr commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145572262
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/PriorityQueue.java
 ---
@@ -20,22 +20,58 @@
 import org.apache.drill.exec.compile.TemplateClassDefinition;
 import org.apache.drill.exec.exception.SchemaChangeException;
 import org.apache.drill.exec.memory.BufferAllocator;
-import org.apache.drill.exec.ops.FragmentContext;
 import org.apache.drill.exec.physical.impl.sort.RecordBatchData;
 import org.apache.drill.exec.record.VectorContainer;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 
 public interface PriorityQueue {
-  public void add(FragmentContext context, RecordBatchData batch) throws 
SchemaChangeException;
-  public void init(int limit, FragmentContext context, BufferAllocator 
allocator, boolean hasSv2) throws SchemaChangeException;
-  public void generate() throws SchemaChangeException;
-  public VectorContainer getHyperBatch();
-  public SelectionVector4 getHeapSv4();
-  public SelectionVector4 getFinalSv4();
-  public boolean validate();
-  public void resetQueue(VectorContainer container, SelectionVector4 
vector4) throws SchemaChangeException;
-  public void cleanup();
-
-  public static TemplateClassDefinition TEMPLATE_DEFINITION 
= new TemplateClassDefinition(PriorityQueue.class, 
PriorityQueueTemplate.class);
+  /**
--- End diff --

I think the priority queue is generic and not specific to TopN. I agree we 
should make an effort to consolidate the code generated data structures.


---

[GitHub] drill pull request #993: DRILL-5874: NPE in AnonWebUserConnection.cleanupSes...

2017-10-18 Thread sohami

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/993#discussion_r145568772
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebUserConnection.java
 ---
@@ -124,6 +124,13 @@ public void sendData(RpcOutcomeListener listener, 
QueryWritableBatch result
 }
   }
 
+  /**
+   * Returns a DefaultChannelPromise which doesn't have reference to any 
actual channel but has an EventExecutor
+   * associated with it. In this case we use EventExecutor out of 
BitServer EventLoopGroup. Since there is no actual
+   * connection established using this class, hence the close event will 
never be fired by underlying layer and close
+   * future is set only when the WebSessionResources are closed.
--- End diff --

The intent of putting this comment here was just as an FYI how the 
underlying CloseFuture is created which is returned by this function and 
stressting the fact that it doesn't have any actual physical connection 
associated with it.

WebUserConnection acts like an adapter between HttpSession and connection 
which Foreman expects with each submitted request. For authenticated user this 
adapter is used to get UserSession object which stores all the updated Drill 
session options changed over the time. 

For anonymous user this adapter is created fresh with each request. But as 
discussed in person I will make a change so that we can maintain session for 
anonymous user and have similar behavior like authenticated user. And when the 
HttpSession for unauthenticated user time's out then it will do clean up of all 
the web resources related to that session.


---

[GitHub] drill pull request #993: DRILL-5874: NPE in AnonWebUserConnection.cleanupSes...

2017-10-18 Thread sohami

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/993#discussion_r145565944
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRestServer.java
 ---
@@ -204,9 +217,15 @@ public WebUserConnection provide() {
 
config.getLong(ExecConstants.HTTP_SESSION_MEMORY_RESERVATION),
 config.getLong(ExecConstants.HTTP_SESSION_MEMORY_MAXIMUM));
 
+// Create a dummy close future which is needed by Foreman only. 
Foreman uses this future to add a close
+// listener to known about channel close event from underlying 
layer. We use this future to notify Foreman
+// listeners when the Web connection between Web Client and 
WebServer is closed. This will help Foreman to cancel
+// all the running queries for this Web Client.
+final ChannelPromise closeFuture = new DefaultChannelPromise(null, 
executor);
+
 // Create a WebSessionResource instance which owns the lifecycle 
of all the session resources.
-// Set this instance as an attribute of HttpSession, since it will 
be used until session is destroyed.
-webSessionResources = new WebSessionResources(sessionAllocator, 
remoteAddress, drillUserSession);
+// Set this instance as an attribute of HttpSession, since it will 
be used until session is destroyed
+webSessionResources = new WebSessionResources(sessionAllocator, 
remoteAddress, drillUserSession, closeFuture);
--- End diff --

Will fix the comment. Sorry for confusion.


---

[GitHub] drill pull request #993: DRILL-5874: NPE in AnonWebUserConnection.cleanupSes...

2017-10-18 Thread sohami

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/993#discussion_r145571583
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/server/rest/WebSessionResourcesTest.java
 ---
@@ -0,0 +1,146 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.server.rest;
+
+import io.netty.channel.ChannelPromise;
+import io.netty.channel.DefaultChannelPromise;
+import io.netty.util.concurrent.EventExecutor;
+import io.netty.util.concurrent.Future;
+import io.netty.util.concurrent.GenericFutureListener;
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.rpc.TransportCheck;
+import org.apache.drill.exec.rpc.user.UserSession;
+import org.junit.Test;
+
+import java.net.SocketAddress;
+import java.util.concurrent.CountDownLatch;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import static org.mockito.Matchers.any;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.verify;
+
+public class WebSessionResourcesTest {
+  //private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(WebSessionResourcesTest.class);
+
+  private WebSessionResources webSessionResources;
+
+  private boolean listenerComplete;
+
+  private CountDownLatch latch;
+
+  private EventExecutor executor;
+
+
--- End diff --

Sure will add a comment here. The purpose of this test was to validate that 
with Null channel in closeFuture which is used by WebSessionResources, during 
`WebSessionResources:close` it works as expected.


---

[GitHub] drill pull request #993: DRILL-5874: NPE in AnonWebUserConnection.cleanupSes...

2017-10-18 Thread sohami

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/993#discussion_r145566366
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebSessionResources.java
 ---
@@ -68,16 +68,19 @@ public SocketAddress getRemoteAddress() {
 
   @Override
   public void close() {
-
 try {
   AutoCloseables.close(webUserSession, allocator);
 } catch (Exception ex) {
   logger.error("Failure while closing the session resources", ex);
 }
 
-// Set the close future associated with this session.
+// Notify all the listeners of this closeFuture for failure events so 
that listeners can do cleanup related to this
+// WebSession. This will be called after every query execution by 
AnonymousWebUserConnection::cleanupSession and
--- End diff --

From my understanding session is tied to a cookie not a connection. In case 
of anonymous session, we call close after completion of each query. But for 
authenticated session the close will be called when HttpSession is invalidated 
which is during logout or session timeout.


---

[GitHub] drill pull request #993: DRILL-5874: NPE in AnonWebUserConnection.cleanupSes...

2017-10-18 Thread sohami

Github user sohami commented on a diff in the pull request:

https://github.com/apache/drill/pull/993#discussion_r145495729
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRestServer.java
 ---
@@ -204,9 +217,15 @@ public WebUserConnection provide() {
 
config.getLong(ExecConstants.HTTP_SESSION_MEMORY_RESERVATION),
 config.getLong(ExecConstants.HTTP_SESSION_MEMORY_MAXIMUM));
 
+// Create a dummy close future which is needed by Foreman only. 
Foreman uses this future to add a close
+// listener to known about channel close event from underlying 
layer. We use this future to notify Foreman
+// listeners when the Web connection between Web Client and 
WebServer is closed. This will help Foreman to cancel
+// all the running queries for this Web Client.
--- End diff --

We are using [web 
session](https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebServer.java#L193)
 concept for Drill's WebServer.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner-mapr

Github user ilooner-mapr commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145571158
  
--- Diff: common/src/test/java/org/apache/drill/test/DirTestWatcher.java ---
@@ -32,23 +32,50 @@
 public class DirTestWatcher extends TestWatcher {
   private String dirPath;
   private File dir;
+  private boolean deleteDirAtEnd = true;
--- End diff --

Done


---

[GitHub] drill pull request #998: DRILL-5887: Display process user/groups info in Dri...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/998#discussion_r145569370
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRoot.java 
---
@@ -85,21 +86,33 @@ public ClusterInfo getClusterInfoJSON() {
 // For all other cases the user info need-not or should-not be 
displayed
 OptionManager optionManager = work.getContext().getOptionManager();
 final boolean isUserLoggedIn = AuthDynamicFeature.isUserLoggedIn(sc);
+final String processUser = isUserLoggedIn ?
+ImpersonationUtil.getProcessUserName() : null;
+final String processUserGroups = isUserLoggedIn ?
+
Joiner.on(",").join(ImpersonationUtil.getProcessUserGroupNames()) : null;
--- End diff --

Are we guaranteed that the list contains no spaces around the commas? (When 
reading from ZK, the string is user entered, and the string might be "fred   ,  
 bob". Do we have to worry about that here?) If the string might contain extra 
spaces, we might want to use code that Karthik provided to split, strip spaces, 
and combine the names.


---

[GitHub] drill pull request #998: DRILL-5887: Display process user/groups info in Dri...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/998#discussion_r145569034
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/DrillRoot.java 
---
@@ -85,21 +86,33 @@ public ClusterInfo getClusterInfoJSON() {
 // For all other cases the user info need-not or should-not be 
displayed
 OptionManager optionManager = work.getContext().getOptionManager();
 final boolean isUserLoggedIn = AuthDynamicFeature.isUserLoggedIn(sc);
+final String processUser = isUserLoggedIn ?
+ImpersonationUtil.getProcessUserName() : null;
--- End diff --

Do we want to display the process user only if the user is logged in? Here, 
are we assuming logged in means authenticated? Why not display the process user 
otherwise?

Rather than null, should we just use a tag such as ""?


---

[GitHub] drill pull request #997: DRILL-5882: C++ Client: [Threat Modeling] Drillbit ...

2017-10-18 Thread bitblender

Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/997#discussion_r145568130
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -595,6 +611,12 @@ connectionStatus_t 
DrillClientImpl::validateHandshake(DrillUserProperties* prope
 
 switch(this->m_handshakeStatus) {
 case exec::user::SUCCESS:
+// Check if client needs auth/encryption and server is not 
requiring it
--- End diff --

Yes. The control flow goes through the AUTH_REQUIRED case when the server 
requires auth.


---

[GitHub] drill pull request #997: DRILL-5882: C++ Client: [Threat Modeling] Drillbit ...

2017-10-18 Thread bitblender

Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/997#discussion_r145567205
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -595,6 +611,12 @@ connectionStatus_t 
DrillClientImpl::validateHandshake(DrillUserProperties* prope
 
 switch(this->m_handshakeStatus) {
 case exec::user::SUCCESS:
+// Check if client needs auth/encryption and server is not 
requiring it
+if(clientNeedsAuthentication(properties) || 
clientNeedsEncryption(properties)) {
--- End diff --

- Externalized the messages to errmsgs.cpp 
- Changed the error message to "Client needs a secure connection but server 
does not support... " to account for the case where auth and/or enc is required 
by the client but missing on the server


---

[GitHub] drill pull request #996: DRILL-5878: TableNotFound exception is being report...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/996#discussion_r145565621
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/SchemaUtilites.java
 ---
@@ -149,6 +168,16 @@ public static void throwSchemaNotFoundException(final 
SchemaPlus defaultSchema,
 .build(logger);
   }
 
+  /** Utility method to throw {@link UserException} with context 
information */
+  public static void throwSchemaNotFoundException(final String 
defaultSchema, final String givenSchemaPath) {
+throw UserException.validationError()
+.message("Schema [%s] is not valid with respect to either root 
schema or current default schema.",
--- End diff --

Because names can contain dots, we must more careful format the schema. The 
schema should consists of a list (array) of name parts. For each name:

* If the name contains a dot, wrap it in back ticks: `` `a.b` ``
* Otherwise, use the name as is: `c`
* Concatenate the name parts with dots: `` `a.b`.c ``

Code to do this might already exist. @vvysotskyi may know.


---

[GitHub] drill pull request #996: DRILL-5878: TableNotFound exception is being report...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/996#discussion_r145564952
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/SchemaUtilites.java
 ---
@@ -77,6 +77,22 @@ public static SchemaPlus findSchema(final SchemaPlus 
defaultSchema, final String
 return findSchema(defaultSchema, schemaPathAsList);
   }
 
+  /**
+   * Utility function to get the commonPrefix schema between two supplied 
schemas.
+   * @param defaultSchema default schema
+   * @param schemaPath current schema path
+   * @param isCaseSensitive true if caseSensitive comparision is required.
+   * @return common prefix schemaPath
+   */
+  public static String getPrefixSchemaPath(final String defaultSchema, 
final String schemaPath, final boolean isCaseSensitive) {
+if (!isCaseSensitive) {
+  return Strings.commonPrefix(defaultSchema.toLowerCase(), 
schemaPath.toLowerCase());
--- End diff --

Is the schema a single name? Probably not since we are talking about common 
prefixes.

Drill supports names with dots. This is done by using an object per name 
part. (A `NamePart` in a `SchemaPath`.) Here are we concatenating the names? If 
so, won't e have false matches?

```
['a', 'b.c'] --> "a.b.c"
['a.b', 'c'] --> "a.b.c"
```
The above are two distinct paths, but they will (wrongly) compare equals 
when concatenated.

Instead, do we want a list of name parts, and want to compare the names 
part by part? Does Calcite represent the path as a collection of objects? (If 
not, we've got larger problems...)


---

[GitHub] drill pull request #996: DRILL-5878: TableNotFound exception is being report...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/996#discussion_r145565023
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/SchemaUtilites.java
 ---
@@ -121,6 +137,9 @@ public static String getSchemaPath(List 
schemaPath) {
 return SCHEMA_PATH_JOINER.join(schemaPath);
   }
 
+  /** Utility method to get the schema path from one or more strings. */
+  public static String getSchemaPath(String s1, String s2, String... rest) 
{ return SCHEMA_PATH_JOINER.join(s1,s2,rest); }
--- End diff --

Can't join the names for reasons cited above.


---

[GitHub] drill pull request #996: DRILL-5878: TableNotFound exception is being report...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/996#discussion_r145564522
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/SchemaUtilites.java
 ---
@@ -77,6 +77,22 @@ public static SchemaPlus findSchema(final SchemaPlus 
defaultSchema, final String
 return findSchema(defaultSchema, schemaPathAsList);
   }
 
+  /**
+   * Utility function to get the commonPrefix schema between two supplied 
schemas.
+   * @param defaultSchema default schema
+   * @param schemaPath current schema path
+   * @param isCaseSensitive true if caseSensitive comparision is required.
+   * @return common prefix schemaPath
+   */
+  public static String getPrefixSchemaPath(final String defaultSchema, 
final String schemaPath, final boolean isCaseSensitive) {
--- End diff --

Isn't Drill always case insensitive since it follows SQL naming rules? 
We've often discussed how to handle case insensitive data sources, but we have 
no uniform, workable design. Is there a pattern we are following here?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread julianhyde

Github user julianhyde commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145561518
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixture.java ---
@@ -584,11 +492,14 @@ public static void defineWorkspace(Drillbit drillbit, 
String pluginName,
   public static final String EXPLAIN_PLAN_TEXT = "text";
   public static final String EXPLAIN_PLAN_JSON = "json";
 
-  public static FixtureBuilder builder() {
-FixtureBuilder builder = new FixtureBuilder()
+  public static FixtureBuilder builder(DirTestWatcher dirTestWatcher) {
--- End diff --

Jeez Paul, please start a review rather than making single review comments. 
I just got 39 emails from you, and so did everyone else on dev@drill.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread ilooner-mapr

Github user ilooner-mapr commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145561373
  
--- Diff: common/src/main/java/org/apache/drill/common/util/TestTools.java 
---
@@ -17,15 +17,28 @@
  */
 package org.apache.drill.common.util;
 
+import java.io.File;
+import java.io.IOException;
 import java.nio.file.Paths;
 
+import org.apache.commons.io.FileUtils;
 import org.junit.rules.TestName;
 import org.junit.rules.TestRule;
 import org.junit.rules.Timeout;
 
+import static org.apache.drill.common.util.TestTools.DataType.PROJECT;
+import static org.apache.drill.common.util.TestTools.DataType.RESOURCE;
+
 public class TestTools {
   // private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(TestTools.class);
 
+  public enum DataType {
--- End diff --

Done changed to FileSource


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145522320
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/ExampleTest.java ---
@@ -59,6 +59,9 @@
 @Ignore
 public class ExampleTest {
 
+  @Rule
+  public final DirTestWatcher dirTestWatcher = new DirTestWatcher();
--- End diff --

This is meant as an example, so perhaps include some Javadoc to explain 
what this is and how to use it. Maybe explain how to use the temp directories, 
and include a new example test below that illustrates the main points. People 
are busy, they don't often don't have time to read documentation, but they 
often will copy & paste a good example...


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145516665
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/BatchUtils.java ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.exec.vector.ValueVector;
+import org.junit.Assert;
+
+import java.io.UnsupportedEncodingException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+
+public class BatchUtils {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(BatchUtils.class);
+
+  public static Map 
containerToObjects(VectorContainer vectorContainer) {
--- End diff --

Do we really want to create yet another way to muck with data? Any reason 
we can't use the `RowSet` concepts here? I fear that if we have multiple ways 
to do the same thing, we won't achieve critical mass on any of them and we'll 
have a bunch of half-baked solutions. I'd rather have one fully-baked solution 
we use over and over. Allows new tests to get done easily because the required 
tools already exist. This, in turn, encourages testing.

What are we doing here that `RowSet` can't (yet) do?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145522737
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/TableFileBuilder.java ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+
+import java.io.BufferedOutputStream;
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.util.List;
+
+public class TableFileBuilder
+{
+  private final List columnNames;
+  private final List columnFormatters;
+  private final List rows = Lists.newArrayList();
+  private final String rowString;
+
+  public TableFileBuilder(List columnNames, List 
columnFormatters) {
--- End diff --

See prior notes. We now have three ways to build a collection of rows...


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145505524
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestCorruptParquetDateCorrection.java
 ---
@@ -76,43 +78,50 @@
   //- detecting corrupt values must be deferred to actual data 
page reading
   //- one from 1.4, where there is a proper created-by, but the 
corruption is present
   private static final String MIXED_CORRUPTED_AND_CORRECT_DATES_PATH =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/mixed_drill_versions";
+  "/parquet/4203_corrupt_dates/mixed_drill_versions";
   // partitioned with 1.2.0, no certain metadata that these were written 
with Drill
   // the value will be checked to see that they look corrupt and they will 
be corrected
   // by default. Users can use the format plugin option 
autoCorrectCorruptDates to disable
   // this behavior if they have foreign parquet files with valid rare date 
values that are
   // in the similar range as Drill's corrupt values
+  private static final String PARTITIONED_1_2_FOLDER = 
"partitioned_with_corruption_4203_1_2";
   private static final String CORRUPTED_PARTITIONED_DATES_1_2_PATH =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/partitioned_with_corruption_4203_1_2";
+  Paths.get("/parquet/4203_corrupt_dates", 
PARTITIONED_1_2_FOLDER).toString();
--- End diff --

Again, do we want an absolute-looking path when we actually have a relative 
path?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145515947
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/BaseDirTestWatcher.java ---
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.base.Charsets;
+import org.apache.commons.io.FileUtils;
+import org.apache.drill.common.util.TestTools;
+import org.apache.drill.exec.util.TestUtilities;
+import org.junit.runner.Description;
+
+import java.io.File;
+import java.io.IOException;
+
+public class BaseDirTestWatcher extends DirTestWatcher {
+  public enum DirType {
+ROOT,
+TEST_TMP;
+  }
+
+  private File tmpDir;
+  private File storeDir;
+  private File dfsTestTmpParentDir;
+  private File dfsTestTmpDir;
+  private File rootDir;
+
+  public BaseDirTestWatcher() {
+super();
+  }
+
+  @Override
+  protected void starting(Description description) {
+super.starting(description);
+
+rootDir = makeSubDir("root");
+tmpDir = new File(rootDir, "tmp");
+storeDir = new File(rootDir, "store");
+dfsTestTmpParentDir = new File(rootDir, "dfsTestTmp");
+
+tmpDir.mkdirs();
+storeDir.mkdirs();
+dfsTestTmpParentDir.mkdirs();
+  }
+
+  public File getTmpDir() {
+return tmpDir;
+  }
+
+  public File getStoreDir() {
+return storeDir;
+  }
+
+  public File getDfsTestTmpParentDir() {
+return dfsTestTmpParentDir;
+  }
+
+  public File getDfsTestTmpDir() {
+return dfsTestTmpDir;
+  }
+
+  public String getDfsTestTmpDirPath() {
+return dfsTestTmpDir.getAbsolutePath().replaceAll("/\\./", "/");
+  }
+
+  public File getRootDir() {
+return rootDir;
+  }
+
+  public String getRootDirPath() {
+return rootDir.getAbsolutePath().replaceAll("/\\./", "/");
+  }
+
+  public void newDfsTestTmpDir() {
+dfsTestTmpDir = 
TestUtilities.createTempDir(BaseTestQuery.dirTestWatcher.getDfsTestTmpParentDir());
+  }
+
+  private File getDir(DirType type) {
+switch (type) {
+  case ROOT:
+return rootDir;
+  case TEST_TMP:
+return dfsTestTmpDir;
+  default:
+throw new IllegalArgumentException(String.format("Unsupported type 
%s", type));
+}
+  }
+
+  public File makeRootSubDir(String relPath) {
--- End diff --

Here is where we might have a bit of a discussion. Should we represent 
directory paths as:

* Strings (as done here)
* Java NIO `Path` objects (which Java seems to want to be the new standard)
* Hadoop `Path` objects (since that is what the Drill file system uses)
* Good old fashioned Java `File` objects (which, while old, works well in 
common cases)

I would suggest we use `File` for references to file system paths:

* The Java `Path` has the same name as the Hadoop `Path`, causing endless 
confusion.
* Hadoop `Path` is overkill for simple uses in tests.
* `String` encourages people to write their own code to join paths by 
concatenating.

The use of `File` keeps things simple and standard. Concatenation is simple.

Other solutions are, of course, possible. We could use `String` and 
replicate the methods on `File` or either of the `Path` classes, say. The point 
is, let's pick a standard and use it everywhere.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145516134
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/BaseDirTestWatcher.java ---
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.base.Charsets;
+import org.apache.commons.io.FileUtils;
+import org.apache.drill.common.util.TestTools;
+import org.apache.drill.exec.util.TestUtilities;
+import org.junit.runner.Description;
+
+import java.io.File;
+import java.io.IOException;
+
+public class BaseDirTestWatcher extends DirTestWatcher {
+  public enum DirType {
+ROOT,
+TEST_TMP;
+  }
+
+  private File tmpDir;
+  private File storeDir;
+  private File dfsTestTmpParentDir;
+  private File dfsTestTmpDir;
+  private File rootDir;
+
+  public BaseDirTestWatcher() {
+super();
+  }
+
+  @Override
+  protected void starting(Description description) {
+super.starting(description);
+
+rootDir = makeSubDir("root");
+tmpDir = new File(rootDir, "tmp");
+storeDir = new File(rootDir, "store");
+dfsTestTmpParentDir = new File(rootDir, "dfsTestTmp");
+
+tmpDir.mkdirs();
+storeDir.mkdirs();
+dfsTestTmpParentDir.mkdirs();
+  }
+
+  public File getTmpDir() {
+return tmpDir;
+  }
+
+  public File getStoreDir() {
+return storeDir;
+  }
+
+  public File getDfsTestTmpParentDir() {
+return dfsTestTmpParentDir;
+  }
+
+  public File getDfsTestTmpDir() {
+return dfsTestTmpDir;
+  }
+
+  public String getDfsTestTmpDirPath() {
+return dfsTestTmpDir.getAbsolutePath().replaceAll("/\\./", "/");
+  }
+
+  public File getRootDir() {
+return rootDir;
+  }
+
+  public String getRootDirPath() {
+return rootDir.getAbsolutePath().replaceAll("/\\./", "/");
+  }
+
+  public void newDfsTestTmpDir() {
+dfsTestTmpDir = 
TestUtilities.createTempDir(BaseTestQuery.dirTestWatcher.getDfsTestTmpParentDir());
+  }
+
+  private File getDir(DirType type) {
+switch (type) {
+  case ROOT:
+return rootDir;
+  case TEST_TMP:
+return dfsTestTmpDir;
+  default:
+throw new IllegalArgumentException(String.format("Unsupported type 
%s", type));
+}
+  }
+
+  public File makeRootSubDir(String relPath) {
+return makeSubDir(relPath, DirType.ROOT);
+  }
+
+  public File makeTestTmpSubDir(String relPath) {
+return makeSubDir(relPath, DirType.TEST_TMP);
+  }
+
+  private File makeSubDir(String relPath, DirType type) {
+File subDir = new File(getDir(type), relPath);
+subDir.mkdirs();
+return subDir;
+  }
+
+  /**
+   * This preserves the relative path of the directory in root
+   * @param relPath
+   * @return
+   */
+  public File copyResourceToRoot(String relPath) {
+return copyTo(relPath, relPath, TestTools.DataType.RESOURCE, 
DirType.ROOT);
+  }
+
+  public File copyFileToRoot(String relPath) {
+return copyTo(relPath, relPath, TestTools.DataType.PROJECT, 
DirType.ROOT);
+  }
+
+  public File copyResourceToRoot(String relPath, String destPath) {
+return copyTo(relPath, destPath, TestTools.DataType.RESOURCE, 
DirType.ROOT);
+  }
+
+  public File copyResourceToTestTmp(String relPath, String destPath) {
+return copyTo(relPath, destPath, TestTools.DataType.RESOURCE, 
DirType.TEST_TMP);
+  }
+
+  private File copyTo(String relPath, String destPath, TestTools.DataType 
dataType, DirType dirType) {
+File file = TestTools.getFile(relPath, dataType);
+
+if (file.isDirectory()) {
+  File subDir = makeSubDir(destPath, dirType);
+  TestTools.copyDirToDest(relPath, subDir,

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145523267
  
--- Diff: exec/java-exec/src/test/resources/topN/one_key_sort.json ---
@@ -12,11 +12,11 @@
 pop:"mock-scan",
 url: "http://apache.org;,
 entries:[
-{records: 1000, types: [
+{records: 10, types: [
--- End diff --

This change reduces the record count by two orders of magnitude. Do we know 
if this results in testing the same functionality as the original test?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145509549
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/TestSimpleExternalSort.java
 ---
@@ -147,7 +152,7 @@ public void sortOneKeyDescendingExternalSortLegacy() 
throws Throwable {
   }
 
   private void sortOneKeyDescendingExternalSort(boolean testLegacy) throws 
Throwable {
-FixtureBuilder builder = ClusterFixture.builder()
+FixtureBuilder builder = ClusterFixture.builder(dirTestWatcher)
--- End diff --

Nice, but I might move this to a method:

```
FixtureBuilder builder = ClusterFixture.builder()
.withTempDir(dirTestWatcher)
.configProperty(...) ...
```

Reason: if this has to go into the constructor, then every constructor use 
must change, and tests must create a temp directory even when not needed. But, 
if passed in using a builder method, then we can use it regardless of how we 
create the builder object, and the directory is optional.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145519174
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixture.java ---
@@ -213,7 +204,7 @@ private void configureZk(FixtureBuilder builder) {
 }
   }
 
-  private void createConfig(FixtureBuilder builder) throws Exception {
+  private void createConfig() throws Exception {
--- End diff --

There are actually three ZK cases:

* Does not use ZK at all.
* Uses ZK, but it is local (mini ZK started within test JDK)
* Uses ZK, but ZK is remote. (Seldom used in tests, needed for an embedded 
Drillbit.)

Are these semantics preserved with the present changes?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145516858
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/BatchUtils.java ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.exec.vector.ValueVector;
+import org.junit.Assert;
+
+import java.io.UnsupportedEncodingException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+
+public class BatchUtils {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(BatchUtils.class);
+
+  public static Map 
containerToObjects(VectorContainer vectorContainer) {
+Map rows = Maps.newHashMap();
+int numCols = vectorContainer.getNumberOfColumns();
+int recordCount = vectorContainer.getRecordCount();
+
+for (int columnIndex = 0; columnIndex < numCols; columnIndex++) {
+  String columnName = 
vectorContainer.getSchema().getColumn(columnIndex).getName();
+  List data = Lists.newArrayList();
+
+  ValueVector.Accessor valueVectorAccessor = vectorContainer
+.getValueVector(columnIndex)
+.getValueVector()
+.getAccessor();
+
+  for (int recordIndex = 0; recordIndex < recordCount; recordIndex++) {
+data.add(valueVectorAccessor.getObject(recordIndex));
+  }
+
+  rows.put(columnName, data);
+}
+
+return rows;
+  }
+
+  public static Map 
hyperBatchAndSelectorToObjects(VectorContainer vectorContainer, 
SelectionVector4 selectionVector4) {
--- End diff --

See `HyperRowSet`


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145521268
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixture.java ---
@@ -584,11 +492,14 @@ public static void defineWorkspace(Drillbit drillbit, 
String pluginName,
   public static final String EXPLAIN_PLAN_TEXT = "text";
   public static final String EXPLAIN_PLAN_JSON = "json";
 
-  public static FixtureBuilder builder() {
-FixtureBuilder builder = new FixtureBuilder()
+  public static FixtureBuilder builder(DirTestWatcher dirTestWatcher) {
--- End diff --

See note above about passing in the test watcher via a method instead of 
the constructor.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145297931
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TopN/TopNBatchTest.java
 ---
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.TopN;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Properties;
+import java.util.Random;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.test.TestBuilder;
+import org.apache.drill.categories.OperatorTest;
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.common.expression.FieldReference;
+import org.apache.drill.common.logical.data.Order;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.Types;
+import org.apache.drill.exec.compile.ClassBuilder;
+import org.apache.drill.exec.compile.CodeCompiler;
+import org.apache.drill.exec.expr.fn.FunctionImplementationRegistry;
+import org.apache.drill.exec.memory.RootAllocator;
+import org.apache.drill.exec.physical.impl.sort.RecordBatchData;
+import org.apache.drill.exec.pop.PopUnitTestBase;
+import org.apache.drill.exec.record.BatchSchema;
+import org.apache.drill.exec.record.ExpandableHyperContainer;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.server.options.OptionSet;
+import org.apache.drill.test.ClientFixture;
+import org.apache.drill.test.ClusterFixture;
+import org.apache.drill.test.FixtureBuilder;
+import org.apache.drill.test.OperatorFixture;
+import org.apache.drill.test.BatchUtils;
+import org.apache.drill.test.DirTestWatcher;
+import org.apache.drill.test.rowSet.RowSetBuilder;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+
+@Category(OperatorTest.class)
+public class TopNBatchTest extends PopUnitTestBase {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(TopNBatchTest.class);
+
+  // Allows you to look at generated code after tests execute
+  @Rule
+  public DirTestWatcher dirTestWatcher = new DirTestWatcher(false);
+
+  /**
+   * Priority queue unit test.
+   * @throws Exception
+   */
+  @Test
+  public void priorityQueueOrderingTest() throws Exception {
+Properties properties = new Properties();
+properties.setProperty(ClassBuilder.CODE_DIR_OPTION, 
dirTestWatcher.getDirPath());
+
+DrillConfig drillConfig = DrillConfig.create(properties);
+OptionSet optionSet = new OperatorFixture.TestOptionSet();
+
+FieldReference expr = FieldReference.getWithQuotedRef("colA");
+Order.Ordering ordering = new 
Order.Ordering(Order.Ordering.ORDER_DESC, expr, Order.Ordering.NULLS_FIRST);
+List orderings = Lists.newArrayList(ordering);
+
+MaterializedField colA = MaterializedField.create("colA", 
Types.required(TypeProtos.MinorType.INT));
+MaterializedField colB = MaterializedField.create("colB", 
Types.required(TypeProtos.MinorType.INT));
+
+List cols = Lists.newArrayList(colA, colB);
+BatchSchema batchSchema = new 
BatchSchema(BatchSchema.SelectionVectorMode.NONE, cols);
+
+try (RootAllocator allocator = new RootAllocator(100_000_000)) {
+  VectorContainer expectedVectors = new RowSetBuilder(allocator, 
batchSchema)
+.add(110, 10)
+.add(109, 9)
+.add(108, 8)
+.add(107, 7)
+.add(106, 6)
+.add(105, 5)
+.add(104, 4)
+.add(103, 3)
+.add(102, 2)
+.add(101, 1)
+.build()
+.container();
+
+  Map expectedTable = 
BatchUtils.containerToObjects(expectedVectors);
+  expectedVectors.clear();
+
+  PriorityQueue queue;

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145520129
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/ClusterFixture.java ---
@@ -226,43 +217,16 @@ private void createConfig(FixtureBuilder builder) 
throws Exception {
 
   serviceSet = null;
   usesZk = true;
-  isLocal = false;
 } else {
   // Embedded Drillbit.
 
   serviceSet = RemoteServiceSet.getLocalServiceSet();
-  isLocal = true;
 }
   }
 
-  private void startDrillbits(FixtureBuilder builder) throws Exception {
-//// Ensure that Drill uses the log directory determined here rather 
than
-//// it's hard-coded defaults. WIP: seems to be needed some times but
-//// not others.
-//
-//String logDir = null;
-//if (builder.tempDir != null) {
-//  logDir = builder.tempDir.getAbsolutePath();
-//}
-//if (logDir == null) {
-//  logDir = config.getString(ExecConstants.DRILL_TMP_DIR);
-//  if (logDir != null) {
-//logDir += "/drill/log";
-//  }
-//}
-//if (logDir == null) {
-//  logDir = "/tmp/drill";
-//}
-//new File(logDir).mkdirs();
-//System.setProperty("drill.log-dir", logDir);
-
-dfsTestTempDir = makeTempDir("dfs-test");
-
-// Clean up any files that may have been left from the
-// last run.
-
-preserveLocalFiles = builder.preserveLocalFiles;
-removeLocalFiles();
+  private void startDrillbits() throws Exception {
+dfsTestTempDir = new File(getRootDir(), "dfs-test");
+dfsTestTempDir.mkdirs();
--- End diff --

Is the "preserve local files" behavior maintained? Let me explain.

In automated tests, files are created, used, and discarded.

But, we also use this framework for ad-hoc tests when debugging. In this 
case, we create files, end the test, and examine the files by hand. Examples: 
logs, output files, query profiles, etc.

So, for this test fixture (but not for `BaseTestQuery`), we want two modes:

* Normal mode: as you have implemented.
* "Preserve local files" mode: files are written to `/tmp/drill` (or some 
other fixed, well-known location, can be within the `target` folder) and 
preserved past test exit.

A method on the `FixtureBuilder` identifies which mode is desired. We turn 
on the "preserve" mode for ad-hoc tests.

For more info on this framework, see [this 
description](https://github.com/paul-rogers/drill/wiki/Cluster-Fixture-Framework).


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145508960
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/xsort/TestExternalSort.java
 ---
@@ -138,34 +141,34 @@ public void testNewColumnsManaged() throws Exception {
 testNewColumns(false);
   }
 
-
   @Test
   public void testNewColumnsLegacy() throws Exception {
 testNewColumns(true);
   }
 
   private void testNewColumns(boolean testLegacy) throws Exception {
 final int record_count = 1;
-String dfs_temp = getDfsTestTmpSchemaLocation();
-System.out.println(dfs_temp);
-File table_dir = new File(dfs_temp, "newColumns");
-table_dir.mkdir();
-try (BufferedOutputStream os = new BufferedOutputStream(new 
FileOutputStream(new File(table_dir, "a.json" {
-  String format = "{ a : %d, b : %d }%n";
-  for (int i = 0; i <= record_count; i += 2) {
-os.write(String.format(format, i, i).getBytes());
-  }
+final String tableDirName = "newColumns";
+
+TableFileBuilder tableA = new TableFileBuilder(Lists.newArrayList("a", 
"b"), Lists.newArrayList("%d", "%d"));
--- End diff --

Clever -- but how do we handle types? The original code created JSON of the 
form:

```
{ a : 10, b : 20 }
```

Aside from the fact that the labels are not true JSON (not quoted), the 
mechanism does not work for strings, which need to be quoted. The mechanism 
here assumes numbers. But, of course, if we assume numbers, we don't need the 
second argument, the "%d".

If you look in the `ClusterFixture` code, you'll see a method 
`ClusterFixture.stringify()` that converts an Object to a SQL-compatible 
string. We could create a similar one for JSON.

But, if we take a step back, we are creating JSON. Perfectly fine JSON 
builder classes exist that can be used to build up type-aware JSON and render 
the result as a string. The one limitation is that these classes are for single 
documents; they can't handle the non-standard list-of-objects format that Drill 
users. Still, we can use the per-object classes to build up the list of objects.

Yet another solution is to use the `RowSet` classes which are type aware. 
Yes, they build value vectors, but that is not important here. What is 
important is that they use the full schema power of Drill: including data type, 
repeated, nullable, etc. Then, we just create a RowSet-to-JSON converter. In 
fact, I may have something like that in the "Jig" project I did earlier; I'll 
rummage around and see if I can find it. 


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145556819
  
--- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java ---
@@ -908,6 +908,15 @@ public void generateTestData(int count) {
 }
 
   <#else> <#-- type.width <= 8 -->
+@Override
+public void add(Object value) {
+  int index = accessor.getValueCount();
+  int valueCount = index + 1;
+  setValueCount(valueCount);
+
+  set(index, (${type.javaType}) value);
--- End diff --

I don't think this will work as you hope it will -- for obscure reasons 
that we've all learned the hard way...

* The accessor does not keep an ongoing count of values. Rather, the value 
count comes from the writer index which is set only at the end of a write.
* This method sets the value count, but doing so is, in general, expensive.
* Vectors in a batch must be kept in sync. This adds a value to one vector, 
but does not coordinate across a set of vectors.
* Not all values can be set by casting. Works for some numeric values, but 
does not handle, say, `add(10)` when the item is a byte or a short because of 
the need for a two-step cast: `(byte)(int) value`.
* Does not handle conversions for date or decimal.

Yes, this is all a colossal pain in the neck. But, it is the reason that 
the `RowSet` classes were created.

We can talk in person about how to move this idea forward effectively.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145518566
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/BatchUtils.java ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.exec.vector.ValueVector;
+import org.junit.Assert;
+
+import java.io.UnsupportedEncodingException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+
+public class BatchUtils {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(BatchUtils.class);
+
+  public static Map 
containerToObjects(VectorContainer vectorContainer) {
+Map rows = Maps.newHashMap();
+int numCols = vectorContainer.getNumberOfColumns();
+int recordCount = vectorContainer.getRecordCount();
+
+for (int columnIndex = 0; columnIndex < numCols; columnIndex++) {
+  String columnName = 
vectorContainer.getSchema().getColumn(columnIndex).getName();
+  List data = Lists.newArrayList();
+
+  ValueVector.Accessor valueVectorAccessor = vectorContainer
+.getValueVector(columnIndex)
+.getValueVector()
+.getAccessor();
+
+  for (int recordIndex = 0; recordIndex < recordCount; recordIndex++) {
+data.add(valueVectorAccessor.getObject(recordIndex));
+  }
+
+  rows.put(columnName, data);
+}
+
+return rows;
+  }
+
+  public static Map 
hyperBatchAndSelectorToObjects(VectorContainer vectorContainer, 
SelectionVector4 selectionVector4) {
+Map rows = Maps.newHashMap();
+int numCols = vectorContainer.getNumberOfColumns();
+int numIndices = selectionVector4.getCount();
+
+for (int columnIndex = 0; columnIndex < numCols; columnIndex++) {
+  String columnName = 
vectorContainer.getSchema().getColumn(columnIndex).getName();
+  List data = Lists.newArrayList();
+
+  VectorWrapper vectorWrapper = 
vectorContainer.getValueVector(columnIndex);
+
+  for (int indexIndex = 0; indexIndex < numIndices; indexIndex++) {
+int sv4Index = selectionVector4.get(indexIndex);
+int batchIndex = SelectionVector4.getBatchIndex(sv4Index);
+int recordIndex = SelectionVector4.getRecordIndex(sv4Index);
+
+ValueVector valueVector = 
vectorWrapper.getValueVectors()[batchIndex];
+Object columnValue = 
valueVector.getAccessor().getObject(recordIndex);
+data.add(columnValue);
+  }
+
+  rows.put(columnName, data);
+}
+
+return rows;
+  }
+
+  public static String toString(Map table) {
+if (table.isEmpty()) {
+  return "[ empty table ]";
+}
+
+List columnNames = Lists.newArrayList(table.keySet());
+Collections.sort(columnNames);
+int numRecords = table.get(columnNames.get(0)).size();
+
+StringBuilder sb = new StringBuilder();
+
+{
+  sb.append("[ ");
+  String separator = "";
+
+  for (String columnName : columnNames) {
+sb.append(separator);
+separator = ", ";
+sb.append(columnName);
+  }
+
+  sb.append(" ]\n");
+}
+
+for (int recordIndex = 0; recordIndex < numRecords; recordIndex++) {
+  sb.append("{");
+  String separator = "";
+
+  for (String columnName : columnNames) {
+sb.append(separator);
+separator = ", ";

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145505747
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestCorruptParquetDateCorrection.java
 ---
@@ -76,43 +78,50 @@
   //- detecting corrupt values must be deferred to actual data 
page reading
   //- one from 1.4, where there is a proper created-by, but the 
corruption is present
   private static final String MIXED_CORRUPTED_AND_CORRECT_DATES_PATH =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/mixed_drill_versions";
+  "/parquet/4203_corrupt_dates/mixed_drill_versions";
   // partitioned with 1.2.0, no certain metadata that these were written 
with Drill
   // the value will be checked to see that they look corrupt and they will 
be corrected
   // by default. Users can use the format plugin option 
autoCorrectCorruptDates to disable
   // this behavior if they have foreign parquet files with valid rare date 
values that are
   // in the similar range as Drill's corrupt values
+  private static final String PARTITIONED_1_2_FOLDER = 
"partitioned_with_corruption_4203_1_2";
   private static final String CORRUPTED_PARTITIONED_DATES_1_2_PATH =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/partitioned_with_corruption_4203_1_2";
+  Paths.get("/parquet/4203_corrupt_dates", 
PARTITIONED_1_2_FOLDER).toString();
   // partitioned with 1.4.0, no certain metadata regarding the date 
corruption status.
   // The same detection approach of the corrupt date values as for the 
files partitioned with 1.2.0
+  private static final String PARTITIONED_1_4_FOLDER = 
"partitioned_with_corruption_4203";
   private static final String CORRUPTED_PARTITIONED_DATES_1_4_0_PATH =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/partitioned_with_corruption_4203";
+  Paths.get("/parquet/4203_corrupt_dates", 
PARTITIONED_1_4_FOLDER).toString();
   private static final String PARQUET_DATE_FILE_WITH_NULL_FILLED_COLS =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/null_date_cols_with_corruption_4203.parquet";
+  
"/parquet/4203_corrupt_dates/null_date_cols_with_corruption_4203.parquet";
+  private static final String PARTITIONED_1_9_FOLDER = 
"1_9_0_partitioned_no_corruption";
   private static final String CORRECT_PARTITIONED_DATES_1_9_PATH =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/1_9_0_partitioned_no_corruption";
+  Paths.get("/parquet/4203_corrupt_dates", 
PARTITIONED_1_9_FOLDER).toString();
   private static final String VARCHAR_PARTITIONED =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/fewtypes_varcharpartition";
+  "/parquet/4203_corrupt_dates/fewtypes_varcharpartition";
   private static final String DATE_PARTITIONED =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/fewtypes_datepartition";
+  "/parquet/4203_corrupt_dates/fewtypes_datepartition";
   private static final String EXCEPTION_WHILE_PARSING_CREATED_BY_META =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/hive1dot2_fewtypes_null";
+  "/parquet/4203_corrupt_dates/hive1dot2_fewtypes_null";
   private static final String CORRECT_DATES_1_6_0_PATH =
-  
"[WORKING_PATH]/src/test/resources/parquet/4203_corrupt_dates/correct_dates_and_old_drill_parquet_writer.parquet";
-  private static final String PARTITIONED_1_2_FOLDER = 
"partitioned_with_corruption_4203_1_2";
+  
"/parquet/4203_corrupt_dates/correct_dates_and_old_drill_parquet_writer.parquet";
   private static final String 
MIXED_CORRUPTED_AND_CORRECT_PARTITIONED_FOLDER = "mixed_partitioned";
 
-
   @BeforeClass
   public static void initFs() throws Exception {
+dirTestWatcher.copyResourceToRoot("/parquet/4203_corrupt_dates");
--- End diff --

Each time I see this I have to remember that this is an absolute path 
relative to some unstated root. That is, although it looks absolute, it is 
actually relative...


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145298677
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggrSpill.java
 ---
@@ -43,146 +44,122 @@
 import static org.junit.Assert.assertTrue;
 
 /**
- *  Test spilling for the Hash Aggr operator (using the mock reader)
+ * Test spilling for the Hash Aggr operator (using the mock reader)
  */
 @Category({SlowTest.class, OperatorTest.class})
 public class TestHashAggrSpill {
 
-private void runAndDump(ClientFixture client, String sql, long 
expectedRows, long spillCycle, long fromSpilledPartitions, long 
toSpilledPartitions) throws Exception {
-String plan = client.queryBuilder().sql(sql).explainJson();
-
-QueryBuilder.QuerySummary summary = 
client.queryBuilder().sql(sql).run();
-if ( expectedRows > 0 ) {
-assertEquals(expectedRows, summary.recordCount());
-}
-// System.out.println(String.format(" \n Results: %,d 
records, %d batches, %,d ms\n ", summary.recordCount(), 
summary.batchCount(), summary.runTimeMs() ) );
-
-//System.out.println("Query ID: " + summary.queryIdString());
-ProfileParser profile = 
client.parseProfile(summary.queryIdString());
-//profile.print();
-List ops = 
profile.getOpsOfType(UserBitShared.CoreOperatorType.HASH_AGGREGATE_VALUE);
-
-assertTrue( ! ops.isEmpty() );
-// check for the first op only
-ProfileParser.OperatorProfile hag0 = ops.get(0);
-long opCycle = 
hag0.getMetric(HashAggTemplate.Metric.SPILL_CYCLE.ordinal());
-assertEquals(spillCycle, opCycle);
-long op_spilled_partitions = 
hag0.getMetric(HashAggTemplate.Metric.SPILLED_PARTITIONS.ordinal());
-assertTrue( op_spilled_partitions >= fromSpilledPartitions && 
op_spilled_partitions <= toSpilledPartitions );
-/* assertEquals(3, ops.size());
-for ( int i = 0; i < ops.size(); i++ ) {
-ProfileParser.OperatorProfile hag = ops.get(i);
-long cycle = 
hag.getMetric(HashAggTemplate.Metric.SPILL_CYCLE.ordinal());
-long num_partitions = 
hag.getMetric(HashAggTemplate.Metric.NUM_PARTITIONS.ordinal());
-long spilled_partitions = 
hag.getMetric(HashAggTemplate.Metric.SPILLED_PARTITIONS.ordinal());
-long mb_spilled = 
hag.getMetric(HashAggTemplate.Metric.SPILL_MB.ordinal());
-System.out.println(String.format("(%d) Spill cycle: %d, num 
partitions: %d, spilled partitions: %d, MB spilled: %d", i,cycle, 
num_partitions, spilled_partitions,
-mb_spilled));
-} */
-}
+  @Rule
+  public final DirTestWatcher dirTestWatcher = new DirTestWatcher();
 
-/**
- *  A template for Hash Aggr spilling tests
- *
- * @throws Exception
- */
-private void testSpill(long maxMem, long numPartitions, long 
minBatches, int maxParallel, boolean fallback ,boolean predict,
-   String sql, long expectedRows, int cycle, int 
fromPart, int toPart) throws Exception {
-LogFixture.LogFixtureBuilder logBuilder = LogFixture.builder()
-  .toConsole()
-  //.logger("org.apache.drill.exec.physical.impl.aggregate", 
Level.INFO)
-  .logger("org.apache.drill", Level.WARN)
-  ;
-
-FixtureBuilder builder = ClusterFixture.builder()
-  .sessionOption(ExecConstants.HASHAGG_MAX_MEMORY_KEY,maxMem)
-  
.sessionOption(ExecConstants.HASHAGG_NUM_PARTITIONS_KEY,numPartitions)
-  
.sessionOption(ExecConstants.HASHAGG_MIN_BATCHES_PER_PARTITION_KEY,minBatches)
-  
.configProperty(ExecConstants.SYS_STORE_PROVIDER_LOCAL_ENABLE_WRITE, false)
-  .sessionOption(PlannerSettings.FORCE_2PHASE_AGGR_KEY,true)
-  // .sessionOption(PlannerSettings.EXCHANGE.getOptionName(), true)
-  .sessionOption(ExecConstants.HASHAGG_FALLBACK_ENABLED_KEY, 
fallback)
-  
.sessionOption(ExecConstants.HASHAGG_USE_MEMORY_PREDICTION_KEY,predict)
-
-  .maxParallelization(maxParallel)
-  .saveProfiles()
-  //.keepLocalFiles()
-  ;
-String sqlStr = sql != null ? sql :  // if null then use this 
default query
-  "SELECT empid_s17, dept_i, branch_i, AVG(salary_i) FROM 
`mock`.`employee_1200K` GROUP BY empid_s17, dept_i, branch_i";
-
-try (LogFixture logs = logBuilder.build();
- ClusterFixture cluster = builder.build();
- ClientFixture client = cluster.clientFixture()) {
-runAndDump(client, sqlStr, expectedRows, cycle, 
fromPart,toPart);

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145292655
  
--- Diff: exec/java-exec/src/main/resources/drill-module.conf ---
@@ -247,7 +248,7 @@ drill.exec: {
 }
   },
   sort: {
-purge.threshold : 10,
+purge.threshold : 1000,
--- End diff --

Are you reverting your own change here?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144943191
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/CodeGenerator.java ---
@@ -90,12 +89,11 @@
   private String generatedCode;
   private String generifiedCode;
 
-  CodeGenerator(TemplateClassDefinition definition, 
FunctionImplementationRegistry funcRegistry, OptionSet optionManager) {
-this(ClassGenerator.getDefaultMapping(), definition, funcRegistry, 
optionManager);
+  CodeGenerator(TemplateClassDefinition definition, OptionSet 
optionManager) {
+this(ClassGenerator.getDefaultMapping(), definition, optionManager);
--- End diff --

Good analysis. It was a real pain to lug the function registry around. 
Presumably the registry is used elsewhere in code gen; perhaps in the 
expression materializer.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145292776
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/TestAggNullable.java ---
@@ -21,29 +21,25 @@
 
 import org.apache.drill.categories.OperatorTest;
 import org.apache.drill.common.util.TestTools;
+import org.apache.drill.test.BaseTestQuery;
 import org.junit.Test;
 import org.junit.experimental.categories.Category;
 
 @Category(OperatorTest.class)
-public class TestAggNullable extends BaseTestQuery{
+public class TestAggNullable extends BaseTestQuery {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(TestAggNullable.class);
 
-  static final String WORKING_PATH = TestTools.getWorkingPath();
-  static final String TEST_RES_PATH = WORKING_PATH + "/src/test/resources";
-
   private static void enableAggr(boolean ha, boolean sa) throws Exception {
 
-test(String.format("alter session set `planner.enable_hashagg` = %s", 
ha ? "true":"false"));
-test(String.format("alter session set `planner.enable_streamagg` = 
%s", sa ? "true":"false"));
+test("alter session set `planner.enable_hashagg` = %s", ha);
+test("alter session set `planner.enable_streamagg` = %s", sa);
--- End diff --

I was about to suggest using the functions created for this purpose. But, 
then I realized those changes are in a PR that has not yet been reviewed...


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145294641
  
--- Diff: exec/java-exec/src/test/java/org/apache/drill/TestBugFixes.java 
---
@@ -33,8 +35,11 @@
 @Category(UnlikelyTest.class)
 public class TestBugFixes extends BaseTestQuery {
   private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(TestBugFixes.class);
-  private static final String WORKING_PATH = TestTools.getWorkingPath();
-  private static final String TEST_RES_PATH = WORKING_PATH + 
"/src/test/resources";
+
+  @BeforeClass
+  public static void setupTestFiles() {
+dirTestWatcher.copyResourceToRoot("/bugs/DRILL-4192");
--- End diff --

Small nit. Presumably `/bugs/DRILL-4192` is a relative path. Should we 
express it in the normal Unix fashion as `bugs/DRILL-4192`? That way, in 
time-honored fashion, a leading slash tells us the path is absolute.


---

[jira] [Created] (DRILL-5891) When Drill runs out of memory for a HashAgg, it should tell the user how much memory to allocate

2017-10-18 Thread Robert Hou (JIRA)

Robert Hou created DRILL-5891:
-

 Summary: When Drill runs out of memory for a HashAgg, it should 
tell the user how much memory to allocate
 Key: DRILL-5891
 URL: https://issues.apache.org/jira/browse/DRILL-5891
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.11.0
Reporter: Robert Hou
Assignee: Pritesh Maker


Query is:
select count(*), max(`filename`) from dfs.`/drill/testdata/hash-agg/data1` 
group by no_nulls_col, nulls_col;

Error is:
Error: RESOURCE ERROR: Not enough memory for internal partitioning and fallback 
mechanism for HashAgg to use unbounded memory is disabled. Either enable 
fallback config drill.exec.hashagg.fallback.enabled using Alter session/system 
command or increase memory limit for Drillbit

>From drillbit.log:
{noformat}
2017-10-18 13:30:17,135 [26184629-3f4c-856a-e99e-97cdf0d29321:frag:1:8] TRACE 
o.a.d.e.p.i.aggregate.HashAggregator - Incoming sizer: Actual batch schema & 
sizes {
  no_nulls_col(type: OPTIONAL VARCHAR, count: 1023, std size: 54, actual size: 
130, data size: 132892)
  nulls_col(type: OPTIONAL VARCHAR, count: 1023, std size: 54, actual size: 
112, data size: 113673)
  EXPR$0(type: REQUIRED BIGINT, count: 1023, std size: 8, actual size: 8, data 
size: 8184)
  EXPR$1(type: OPTIONAL VARCHAR, count: 1023, std size: 54, actual size: 18, 
data size: 18414)
  Records: 1023, Total size: 524288, Data size: 273163, Gross row width: 513, 
Net row width: 268, Density: 53%}
2017-10-18 13:30:17,135 [26184629-3f4c-856a-e99e-97cdf0d29321:frag:1:8] TRACE 
o.a.d.e.p.i.aggregate.HashAggregator - 2nd phase. Estimated internal row width: 
166 Values row width: 66 batch size: 12779520  memory limit: 63161283  max 
column width: 50
2017-10-18 13:30:17,139 [26184629-3f4c-856a-e99e-97cdf0d29321:frag:3:2] TRACE 
o.a.d.e.p.impl.common.HashTable - HT allocated 4784128 for varchar of max width 
50
2017-10-18 13:30:17,139 [26184629-3f4c-856a-e99e-97cdf0d29321:frag:1:15] INFO  
o.a.d.e.p.i.aggregate.HashAggregator - User Error Occurred: Not enough memory 
for internal partitioning and fallback mechanism for HashAgg to use unbounded 
memory is disabled. Either enable fallback config 
drill.exec.hashagg.fallback.enabled using Alter session/system command or 
increase memory limit for Drillbit
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: Not enough 
memory for internal partitioning and fallback mechanism for HashAgg to use 
unbounded memory is disabled. Either enable fallback config 
drill.exec.hashagg.fallback.enabled using Alter session/system command or 
increase memory limit for Drillbit
{noformat}

I would recommend that we add a log message with the "alter" command to 
increase the amount of memory allocated, and how much memory to allocate.  
Otherwise, the user may not know what to do.

I would also not suggest enabling "drill.exec.hashagg.fallback.enabled" except 
as a last resort.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144943882
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/PriorityQueue.java
 ---
@@ -20,22 +20,58 @@
 import org.apache.drill.exec.compile.TemplateClassDefinition;
 import org.apache.drill.exec.exception.SchemaChangeException;
 import org.apache.drill.exec.memory.BufferAllocator;
-import org.apache.drill.exec.ops.FragmentContext;
 import org.apache.drill.exec.physical.impl.sort.RecordBatchData;
 import org.apache.drill.exec.record.VectorContainer;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 
 public interface PriorityQueue {
-  public void add(FragmentContext context, RecordBatchData batch) throws 
SchemaChangeException;
-  public void init(int limit, FragmentContext context, BufferAllocator 
allocator, boolean hasSv2) throws SchemaChangeException;
-  public void generate() throws SchemaChangeException;
-  public VectorContainer getHyperBatch();
-  public SelectionVector4 getHeapSv4();
-  public SelectionVector4 getFinalSv4();
-  public boolean validate();
-  public void resetQueue(VectorContainer container, SelectionVector4 
vector4) throws SchemaChangeException;
-  public void cleanup();
-
-  public static TemplateClassDefinition TEMPLATE_DEFINITION 
= new TemplateClassDefinition(PriorityQueue.class, 
PriorityQueueTemplate.class);
+  /**
+   * The elements in the given batch are added to the priority queue. Note 
that the priority queue
+   * only retains the top elements that fit within the size specified by 
the {@link #init(int, BufferAllocator, boolean)}
+   * method.
+   * @param batch The batch containing elements we want to add.
+   * @throws SchemaChangeException
+   */
+  void add(RecordBatchData batch) throws SchemaChangeException;
 
+  /**
+   * Initializes the priority queue. This method must be called before any 
other methods on the priority
+   * queue are called.
+   * @param limit The size of the priority queue.
+   * @param allocator The {@link BufferAllocator} to use when creating the 
priority queue.
+   * @param hasSv2 True when incoming batches have v2 selection vectors. 
False otherwise.
--- End diff --

Actutally, the name "sv2" refers to a 2 -byte Selection vector. Welcome to 
Drill's cryptic naming...


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144946834
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/TopNBatch.java
 ---
@@ -335,20 +336,42 @@ private void purge() throws SchemaChangeException {
 logger.debug("Took {} us to purge", 
watch.elapsed(TimeUnit.MICROSECONDS));
   }
 
-  public PriorityQueue createNewPriorityQueue(FragmentContext context, 
List orderings,
- VectorAccessible 
batch, MappingSet mainMapping, MappingSet leftMapping, MappingSet rightMapping)
-  throws ClassTransformationException, IOException, 
SchemaChangeException{
-CodeGenerator cg = 
CodeGenerator.get(PriorityQueue.TEMPLATE_DEFINITION, 
context.getFunctionRegistry(), context.getOptions());
+  private PriorityQueue createNewPriorityQueue(VectorAccessible batch, int 
limit)
+throws SchemaChangeException, ClassTransformationException, 
IOException {
+return createNewPriorityQueue(
+  mainMapping, leftMapping, rightMapping, context.getOptionSet(), 
context.getFunctionRegistry(), context.getDrillbitContext().getCompiler(),
+  config.getOrderings(), batch, unionTypeEnabled, codegenDump, limit, 
oContext.getAllocator(), schema.getSelectionVectorMode());
+  }
+
+  public static MappingSet createMainMappingSet() {
+return new MappingSet((String) null, null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createLeftMappingSet() {
+return new MappingSet("leftIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createRightMappingSet() {
+return new MappingSet("rightIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static PriorityQueue createNewPriorityQueue(
+MappingSet mainMapping, MappingSet leftMapping, MappingSet 
rightMapping,
--- End diff --

Do we ever use more than one mapping set? If not, can we just refer to the 
constants inside this function?

I see that a simpler form exists. Just curious why we also need the complex 
form. For better testing?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145517411
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/BatchUtils.java ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.exec.vector.ValueVector;
+import org.junit.Assert;
+
+import java.io.UnsupportedEncodingException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Map;
+
+public class BatchUtils {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(BatchUtils.class);
+
+  public static Map 
containerToObjects(VectorContainer vectorContainer) {
+Map rows = Maps.newHashMap();
+int numCols = vectorContainer.getNumberOfColumns();
+int recordCount = vectorContainer.getRecordCount();
+
+for (int columnIndex = 0; columnIndex < numCols; columnIndex++) {
+  String columnName = 
vectorContainer.getSchema().getColumn(columnIndex).getName();
+  List data = Lists.newArrayList();
+
+  ValueVector.Accessor valueVectorAccessor = vectorContainer
+.getValueVector(columnIndex)
+.getValueVector()
+.getAccessor();
+
+  for (int recordIndex = 0; recordIndex < recordCount; recordIndex++) {
+data.add(valueVectorAccessor.getObject(recordIndex));
+  }
+
+  rows.put(columnName, data);
+}
+
+return rows;
+  }
+
+  public static Map 
hyperBatchAndSelectorToObjects(VectorContainer vectorContainer, 
SelectionVector4 selectionVector4) {
+Map rows = Maps.newHashMap();
+int numCols = vectorContainer.getNumberOfColumns();
+int numIndices = selectionVector4.getCount();
+
+for (int columnIndex = 0; columnIndex < numCols; columnIndex++) {
+  String columnName = 
vectorContainer.getSchema().getColumn(columnIndex).getName();
+  List data = Lists.newArrayList();
+
+  VectorWrapper vectorWrapper = 
vectorContainer.getValueVector(columnIndex);
+
+  for (int indexIndex = 0; indexIndex < numIndices; indexIndex++) {
+int sv4Index = selectionVector4.get(indexIndex);
+int batchIndex = SelectionVector4.getBatchIndex(sv4Index);
+int recordIndex = SelectionVector4.getRecordIndex(sv4Index);
+
+ValueVector valueVector = 
vectorWrapper.getValueVectors()[batchIndex];
+Object columnValue = 
valueVector.getAccessor().getObject(recordIndex);
+data.add(columnValue);
+  }
+
+  rows.put(columnName, data);
+}
+
+return rows;
+  }
+
+  public static String toString(Map table) {
+if (table.isEmpty()) {
+  return "[ empty table ]";
+}
+
+List columnNames = Lists.newArrayList(table.keySet());
+Collections.sort(columnNames);
+int numRecords = table.get(columnNames.get(0)).size();
+
+StringBuilder sb = new StringBuilder();
+
+{
+  sb.append("[ ");
+  String separator = "";
+
+  for (String columnName : columnNames) {
+sb.append(separator);
+separator = ", ";
+sb.append(columnName);
+  }
+
+  sb.append(" ]\n");
+}
+
+for (int recordIndex = 0; recordIndex < numRecords; recordIndex++) {
+  sb.append("{");
+  String separator = "";
+
+  for (String columnName : columnNames) {
+sb.append(separator);
+separator = ", ";

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145292056
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/TopNBatch.java
 ---
@@ -335,20 +336,42 @@ private void purge() throws SchemaChangeException {
 logger.debug("Took {} us to purge", 
watch.elapsed(TimeUnit.MICROSECONDS));
   }
 
-  public PriorityQueue createNewPriorityQueue(FragmentContext context, 
List orderings,
- VectorAccessible 
batch, MappingSet mainMapping, MappingSet leftMapping, MappingSet rightMapping)
-  throws ClassTransformationException, IOException, 
SchemaChangeException{
-CodeGenerator cg = 
CodeGenerator.get(PriorityQueue.TEMPLATE_DEFINITION, 
context.getFunctionRegistry(), context.getOptions());
+  private PriorityQueue createNewPriorityQueue(VectorAccessible batch, int 
limit)
+throws SchemaChangeException, ClassTransformationException, 
IOException {
+return createNewPriorityQueue(
+  mainMapping, leftMapping, rightMapping, context.getOptionSet(), 
context.getFunctionRegistry(), context.getDrillbitContext().getCompiler(),
+  config.getOrderings(), batch, unionTypeEnabled, codegenDump, limit, 
oContext.getAllocator(), schema.getSelectionVectorMode());
+  }
+
+  public static MappingSet createMainMappingSet() {
+return new MappingSet((String) null, null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createLeftMappingSet() {
+return new MappingSet("leftIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static MappingSet createRightMappingSet() {
+return new MappingSet("rightIndex", null, 
ClassGenerator.DEFAULT_SCALAR_MAP, ClassGenerator.DEFAULT_SCALAR_MAP);
+  }
+
+  public static PriorityQueue createNewPriorityQueue(
+MappingSet mainMapping, MappingSet leftMapping, MappingSet 
rightMapping,
--- End diff --

One thought worth sharing here. Today, our code cache uses source code as 
the key to look for an existing compiled version of a class. But, this is 
inefficient: we have to generate the source code, transform it, and store two 
copies to act as keys.

Better would be to have a "class definition" object that holds all 
parameters needed to generate the code. Store these in the cache. Likely, these 
will be much smaller than the generated code. Using these avoids the need to 
generate the code to see if we already have a copy of that code.

The class definition idea works as long as the definition is a closed set: 
everything needed to generate the code is in that one definition object.

Your function here is close: it passes in a closed set of items. The open 
bits are the expression and system/session options. For options, we can copy 
the few values we need into our definition. The expression might require a bit 
more thought to capture.

So, if we can evolve your idea further, we can get a potential large 
per-schema performance improvement by avoiding code generation when doing a 
repeat query.

Too much to fix here; but something to think about moving forward.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145294064
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/TestAltSortQueries.java ---
@@ -19,24 +19,33 @@
 
 import org.apache.drill.categories.OperatorTest;
 import org.apache.drill.categories.SqlTest;
+import org.apache.drill.test.BaseTestQuery;
+import org.junit.BeforeClass;
 import org.junit.Test;
 import org.junit.experimental.categories.Category;
 
 @Category({SqlTest.class, OperatorTest.class})
-public class TestAltSortQueries extends BaseTestQuery{
+public class TestAltSortQueries extends BaseTestQuery {
   static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(TestAltSortQueries.class);
 
+  @BeforeClass
+  public static void setupTestFiles() {
+dirTestWatcher.copyFileToRoot("sample-data/region.parquet");
+dirTestWatcher.copyFileToRoot("sample-data/regionsSF");
+dirTestWatcher.copyFileToRoot("sample-data/nation.parquet");
+  }
+
   @Test
   public void testOrderBy() throws Exception{
 test("select R_REGIONKEY " +
- "from dfs_test.`[WORKING_PATH]/../../sample-data/region.parquet` 
" +
+ "from dfs_test.`/sample-data/region.parquet` " +
--- End diff --

I like this improvement. Here we created files just for this one test. You 
are using a workspace to point to those files. Do so avoids the need to doctor 
up the strings.

Is `dfs_test` setup to point to the per-query root directory? Seems to be.

Since I have to ask these questions, would be good to explain in the 
`...drill.test` `package-info.java` file how to set up temporary files and the 
mapping of workspaces to these directories. That will allow others to readily 
use the mechanism. Otherwise, people will guess, and will go back to using what 
is familiar. See `ExampleTest.java` for an example of a "starter" file to help 
people get started.

In fact, maybe we would add a new example test to that file that shows how 
to create a test-specific data file and query that file. (That would sure help 
me...)


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144944018
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/PriorityQueue.java
 ---
@@ -20,22 +20,58 @@
 import org.apache.drill.exec.compile.TemplateClassDefinition;
 import org.apache.drill.exec.exception.SchemaChangeException;
 import org.apache.drill.exec.memory.BufferAllocator;
-import org.apache.drill.exec.ops.FragmentContext;
 import org.apache.drill.exec.physical.impl.sort.RecordBatchData;
 import org.apache.drill.exec.record.VectorContainer;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 
 public interface PriorityQueue {
-  public void add(FragmentContext context, RecordBatchData batch) throws 
SchemaChangeException;
-  public void init(int limit, FragmentContext context, BufferAllocator 
allocator, boolean hasSv2) throws SchemaChangeException;
-  public void generate() throws SchemaChangeException;
-  public VectorContainer getHyperBatch();
-  public SelectionVector4 getHeapSv4();
-  public SelectionVector4 getFinalSv4();
-  public boolean validate();
-  public void resetQueue(VectorContainer container, SelectionVector4 
vector4) throws SchemaChangeException;
-  public void cleanup();
-
-  public static TemplateClassDefinition TEMPLATE_DEFINITION 
= new TemplateClassDefinition(PriorityQueue.class, 
PriorityQueueTemplate.class);
+  /**
+   * The elements in the given batch are added to the priority queue. Note 
that the priority queue
+   * only retains the top elements that fit within the size specified by 
the {@link #init(int, BufferAllocator, boolean)}
+   * method.
+   * @param batch The batch containing elements we want to add.
+   * @throws SchemaChangeException
+   */
+  void add(RecordBatchData batch) throws SchemaChangeException;
 
+  /**
+   * Initializes the priority queue. This method must be called before any 
other methods on the priority
+   * queue are called.
+   * @param limit The size of the priority queue.
+   * @param allocator The {@link BufferAllocator} to use when creating the 
priority queue.
+   * @param hasSv2 True when incoming batches have v2 selection vectors. 
False otherwise.
+   * @throws SchemaChangeException
+   */
+  void init(int limit, BufferAllocator allocator, boolean hasSv2) throws 
SchemaChangeException;
+
+  /**
+   * This method must be called before fetching the final heap hyper batch 
and final Sv4 vector.
--- End diff --

Ah, I see. Thanks.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145293197
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/TestAggNullable.java ---
@@ -61,10 +57,8 @@ public void testHashAggNullableColumns() throws 
Exception {
 
   @Test  // StreamingAgg on nullable columns
   public void testStreamAggNullableColumns() throws Exception {
-String query1 = String.format("select t2.b2 from 
dfs_test.`%s/jsoninput/nullable2.json` t2 " +
-" group by t2.b2", TEST_RES_PATH);
-String query2 = String.format("select t2.a2, t2.b2 from 
dfs_test.`%s/jsoninput/nullable2.json` t2 " +
-" group by t2.a2, t2.b2", TEST_RES_PATH);
+String query1 = "select t2.b2 from cp.`/jsoninput/nullable2.json` t2 
group by t2.b2";
+String query2 = "select t2.a2, t2.b2 from 
cp.`/jsoninput/nullable2.json` t2 group by t2.a2, t2.b2";
--- End diff --

Very nice improvement; this has been bugging me for months. Why doctor up 
the query string when we have perfectly fine workspace mechanism.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145292505
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/selection/SelectionVector4.java
 ---
@@ -116,6 +116,14 @@ public void clear() {
 }
   }
 
+  public static int getBatchIndex(int sv4Index) {
+return (sv4Index >> 16) & 0x;
+  }
+
+  public static int getRecordIndex(int sv4Index) {
+return (sv4Index) & 0x;
+  }
--- End diff --

I see you gave into the temptation to wrap these magic expressions in 
methods rather than repeating them over and over in generated code. The worry 
probably was that the overhead of a method call per value would be expensive. 
Did you do a bit of performance testing to demonstrate that Java inlines these 
methods so we get the same performance with this code as the original?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145297229
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/TestWindowFunctions.java ---
@@ -559,7 +563,7 @@ public void testRankWithGroupBy() throws Exception {
 
   @Test // DRILL-3404
   public void testWindowSumAggIsNotNull() throws Exception {
-String query = String.format("select count(*) cnt from (select sum ( 
c1 ) over ( partition by c2 order by c1 asc nulls first ) w_sum from 
dfs.`%s/window/table_with_nulls.parquet` ) sub_query where w_sum is not null", 
TEST_RES_PATH);
+String query = "select count(*) cnt from (select sum ( c1 ) over ( 
partition by c2 order by c1 asc nulls first ) w_sum from 
dfs.`/window/table_with_nulls.parquet` ) sub_query where w_sum is not null";
--- End diff --

Right about now I have to give you a huge THANKS! This is a huge amount of 
clean-up you've done. These are really good improvements. Really appreciate the 
tedious work to get this all done.


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144945381
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/TopN/PriorityQueue.java
 ---
@@ -20,22 +20,58 @@
 import org.apache.drill.exec.compile.TemplateClassDefinition;
 import org.apache.drill.exec.exception.SchemaChangeException;
 import org.apache.drill.exec.memory.BufferAllocator;
-import org.apache.drill.exec.ops.FragmentContext;
 import org.apache.drill.exec.physical.impl.sort.RecordBatchData;
 import org.apache.drill.exec.record.VectorContainer;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 
 public interface PriorityQueue {
-  public void add(FragmentContext context, RecordBatchData batch) throws 
SchemaChangeException;
-  public void init(int limit, FragmentContext context, BufferAllocator 
allocator, boolean hasSv2) throws SchemaChangeException;
-  public void generate() throws SchemaChangeException;
-  public VectorContainer getHyperBatch();
-  public SelectionVector4 getHeapSv4();
-  public SelectionVector4 getFinalSv4();
-  public boolean validate();
-  public void resetQueue(VectorContainer container, SelectionVector4 
vector4) throws SchemaChangeException;
-  public void cleanup();
-
-  public static TemplateClassDefinition TEMPLATE_DEFINITION 
= new TemplateClassDefinition(PriorityQueue.class, 
PriorityQueueTemplate.class);
+  /**
+   * The elements in the given batch are added to the priority queue. Note 
that the priority queue
+   * only retains the top elements that fit within the size specified by 
the {@link #init(int, BufferAllocator, boolean)}
+   * method.
+   * @param batch The batch containing elements we want to add.
+   * @throws SchemaChangeException
+   */
+  void add(RecordBatchData batch) throws SchemaChangeException;
 
+  /**
+   * Initializes the priority queue. This method must be called before any 
other methods on the priority
+   * queue are called.
+   * @param limit The size of the priority queue.
+   * @param allocator The {@link BufferAllocator} to use when creating the 
priority queue.
+   * @param hasSv2 True when incoming batches have v2 selection vectors. 
False otherwise.
+   * @throws SchemaChangeException
+   */
+  void init(int limit, BufferAllocator allocator, boolean hasSv2) throws 
SchemaChangeException;
+
+  /**
+   * This method must be called before fetching the final heap hyper batch 
and final Sv4 vector.
+   * @throws SchemaChangeException
+   */
+  void generate() throws SchemaChangeException;
+
+  /**
+   * Retrieves the final priority queue HyperBatch containing the results. 
Note: this should be called
+   * after {@link #generate()}.
+   * @return The final priority queue HyperBatch containing the results.
+   */
+  VectorContainer getHyperBatch();
+
+  SelectionVector4 getSv4();
+
+  /**
+   * Retrieves the selection vector used to select the elements in the 
priority queue from the hyper batch
+   * provided by the {@link #getHyperBatch()} method. Note: this 
should be called after {@link #generate()}.
+   * @return The selection vector used to select the elements in the 
priority queue.
+   */
+  SelectionVector4 getFinalSv4();
--- End diff --

Taking a step back... Suppose I do a top 100. Also, suppose my batches are 
big, say 100 MB each. Also, suppose I have fiendishly organized by data so that 
the top 100 values are the first values of 100 batches.

If the priority queue uses an SV4, then it means that the queue holds onto 
the entire batch that holds the set of top values. In my case, since the top 
100 values occur across 100 batches, I'm holding onto 100 of batches of 100 MB 
each for a total of 10 GB of memory.

Is this how it works?

Do we need to think about compaction at some point? Once we have buffered, 
say, five batches, we copy the values to a new, much smaller, batch and discard 
the inputs. We repeat this over and over to keep the number of buffered batches 
under control.

Once we control batch size (in bytes; we already control records), we'll 
have the means to set a memory budget for TopN, then use consolidation to stay 
within that budget.

Or, is this how the code already works?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r145294835
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/TestCTASPartitionFilter.java ---
@@ -59,48 +58,48 @@ public void withDistribution() throws Exception {
 test("alter session set `planner.slice_target` = 1");
 test("alter session set `store.partition.hash_distribute` = true");
 test("use dfs_test.tmp");
-test(String.format("create table orders_distribution partition by 
(o_orderpriority) as select * from dfs_test.`%s/multilevel/parquet`", 
TEST_RES_PATH));
+test("create table orders_distribution partition by (o_orderpriority) 
as select * from dfs_test.`/multilevel/parquet`");
 String query = "select * from orders_distribution where 
o_orderpriority = '1-URGENT'";
-testExcludeFilter(query, 1, "Filter", 24);
+testExcludeFilter(query, 1, "Filter\\(", 24);
--- End diff --

Why the extra characters?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144936994
  
--- Diff: common/src/test/java/org/apache/drill/test/DirTestWatcher.java ---
@@ -32,23 +32,50 @@
 public class DirTestWatcher extends TestWatcher {
   private String dirPath;
   private File dir;
+  private boolean deleteDirAtEnd = true;
--- End diff --

Thanks for the Javadoc added previously. Perhaps add a section on how to 
use this class. If I need a test directory, do I create an instance of this 
class? Or, does JUnit do it for me? If done automagically, how do I get a copy 
of the directory?

If this is explained in JUnit, perhaps just reference that information. 
Even better would be a short example:

> To get a test directory:
```
// Do something here
File myDir = // do something
```


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144936263
  
--- Diff: common/src/test/java/org/apache/drill/test/SubDirTestWatcher.java 
---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.test;
+
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import org.junit.rules.TestWatcher;
+import org.junit.runner.Description;
+
+import java.io.File;
+import java.util.List;
+
+public class SubDirTestWatcher extends TestWatcher {
--- End diff --

Maybe a Javadoc comment to explain what this does? From the name, it must 
have something to do with a temporary sub dir... But, how does this relate to 
the DirTestWatcher?


---

[GitHub] drill pull request #984: DRILL-5783 Made a unit test for generated Priority ...

2017-10-18 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/984#discussion_r144935089
  
--- Diff: common/src/main/java/org/apache/drill/common/util/TestTools.java 
---
@@ -17,15 +17,28 @@
  */
 package org.apache.drill.common.util;
 
+import java.io.File;
+import java.io.IOException;
 import java.nio.file.Paths;
 
+import org.apache.commons.io.FileUtils;
 import org.junit.rules.TestName;
 import org.junit.rules.TestRule;
 import org.junit.rules.Timeout;
 
+import static org.apache.drill.common.util.TestTools.DataType.PROJECT;
+import static org.apache.drill.common.util.TestTools.DataType.RESOURCE;
+
 public class TestTools {
   // private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(TestTools.class);
 
+  public enum DataType {
--- End diff --

`ResourceType`? `FileType`? `FileSource`? `ResourceScope`?

This is a symbolic reference to a resource root, not really a kind of a 
data item...


---

[jira] [Created] (DRILL-5890) Tests Leak Many Open File Descriptors

2017-10-18 Thread Timothy Farkas (JIRA)

Timothy Farkas created DRILL-5890:
-

 Summary: Tests Leak Many Open File Descriptors
 Key: DRILL-5890
 URL: https://issues.apache.org/jira/browse/DRILL-5890
 Project: Apache Drill
  Issue Type: Bug
Reporter: Timothy Farkas
Assignee: salim achouche


Salim and I have discovered that the tests leak many open file descriptors and 
the tests can hang with even a 64k open file limit. Also doing an lsof 
periodically shows the number of open files steadily grows over time as the 
tests run. Fixing this would likely speed up the unit tests and prevent 
developers from scratching their heads about why the tests are hanging or 
throwing Too Many Open file exceptions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (DRILL-5889) sqlline loses RPC connection with executing query with HashAgg

2017-10-18 Thread Robert Hou (JIRA)

Robert Hou created DRILL-5889:
-

 Summary: sqlline loses RPC connection with executing query with 
HashAgg
 Key: DRILL-5889
 URL: https://issues.apache.org/jira/browse/DRILL-5889
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.11.0
Reporter: Robert Hou


Query is:
{noformat}
alter session set `planner.memory.max_query_memory_per_node` = 10737418240;
select count(*), max(`filename`) from dfs.`/drill/testdata/hash-agg/data1` 
group by no_nulls_col, nulls_col;
{noformat}

Error is:
{noformat}
0: jdbc:drill:drillbit=10.10.100.190> select count(*), max(`filename`) from 
dfs.`/drill/testdata/hash-agg/data1` group by no_nulls_col, nulls_col;
Error: CONNECTION ERROR: Connection /10.10.100.190:45776 <--> 
/10.10.100.190:31010 (user client) closed unexpectedly. Drillbit down?


[Error Id: db4aea70-11e6-4e63-b0cc-13cdba0ee87a ] (state=,code=0)
{noformat}

>From drillbit.log:
2017-10-18 14:04:23,044 [UserServer-1] INFO  o.a.drill.exec.rpc.user.UserServer 
- RPC connection /10.10.100.190:31010 <--> /10.10.100.190:45776 (user server) 
timed out.  Timeout was set to 30 seconds. Closing connection.

Plan is:
{noformat}
| 00-00Screen
00-01  Project(EXPR$0=[$0], EXPR$1=[$1])
00-02UnionExchange
01-01  Project(EXPR$0=[$2], EXPR$1=[$3])
01-02HashAgg(group=[{0, 1}], EXPR$0=[$SUM0($2)], EXPR$1=[MAX($3)])
01-03  Project(no_nulls_col=[$0], nulls_col=[$1], EXPR$0=[$2], 
EXPR$1=[$3])
01-04HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
02-01  UnorderedMuxExchange
03-01Project(no_nulls_col=[$0], nulls_col=[$1], 
EXPR$0=[$2], EXPR$1=[$3], E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($1, 
hash32AsDouble($0, 1301011))])
03-02  HashAgg(group=[{0, 1}], EXPR$0=[COUNT()], 
EXPR$1=[MAX($2)])
03-03Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=maprfs:///drill/testdata/hash-agg/data1]], 
selectionRoot=maprfs:/drill/testdata/hash-agg/data1, numFiles=1, 
usedMetadataFile=false, columns=[`no_nulls_col`, `nulls_col`, `filename`]]])
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: describe query support? (catalog metadata, etc)

2017-10-18 Thread Charles Givre

I’d like to second Alfredo’s request.  I’ve been trying to get Drill to work 
with some open source visualization tools such as SqlPad and Metabase and the 
issue I keep running into is that Drill doesn’t have a convenient way to 
describe how it interprets flat files.  This is really frustrating for me since 
this is my main use of Drill!  
I wish the SELECT * FROM  LIMIT 0 worked in the RESTFul interface.  In 
any event, would be very useful to have some way to get Drill to describe how 
it will interpret a flat file.
— C


> On Oct 18, 2017, at 15:20, Chun Chang  wrote:
> 
> There were discussions on the need of building a catalog for drill. But I 
> don't think that's the focus right now. And I am not sure the community will 
> ever decide to go in that direction. For now, you best bet is to create views 
> on top of your JSON/CSV data.
> 
> 
> From: Alfredo Serafini 
> Sent: Wednesday, October 18, 2017 8:31:15 AM
> To: u...@drill.apache.org
> Subject: describe query support? (catalog metadata, etc)
> 
> Hi I'm experimenting using Drill as a data virtualization component via
> JDBC and it generally works great for my needs.
> 
> However some of the components connected via JDBC needs basic
> metadata/catalog informations, and they seems to be missing for JSON / CSV
> sources.
> 
> For example the simple query
> 
> DESCRIBE cp.`employee.json`;
> 
> returns no results.
> 
> Another possible example case could be when reading from an sqlite source
> containing the same data on an `employees` table
> DESCRIBE `emploees`
> 
> and still get no information: while this command is not directly supported
> in SQLite, an equivalent one could be for instance:
> PRAGMA table_info(`employees`);
> 
> but trying to execute it in Drill is not possible, as it is beyond the
> supported standard SQL dialect.
> 
> Moreover using a query like:
> SELECT *
> FROM INFORMATION_SCHEMA.COLUMNS
> WHERE (TABLE_NAME='employees_view');
> 
> on a view from the same data, seems to return the informations, so I
> suppose there should be a way to pass those informations to an
> internal *DatabaseMetaData
> *
> implementation.
> I wonder if there is such a component designed to manage all the catalog
> informations for different sources?
> 
> In this case it could adopt different strategies for retrieving metadata,
> depending on the case: for sqlite a different command / dialect could be
> used, for CSV types could be guessed using simple heuristics, and so on.
> Probably cases like JSON would be much more complex, anyway.
> Once the metadata have been retrieved for a source, I suppose the standard
> SQL dialect should work as expected.
> 
> 
> Are there any plans to add catalog metadata support for various sources?
> Does anybody have some workaround? for example using views or similar
> approaches?
> 
> 
> thanks in advance, sorry if the message is too long :-)
> Alfredo

[GitHub] drill pull request #1002: DRILL-5888: Remove dependency of SSLConfig on hado...

2017-10-18 Thread parthchandra

GitHub user parthchandra opened a pull request:

https://github.com/apache/drill/pull/1002

DRILL-5888: Remove dependency of SSLConfig on hadoop.security packageâ¦

â¦. This allows jdbc-all-jar to be built without hadoop dependencies.

The original  code used Hadoop's SSLFactory.Mode enum. The same enum is now 
replicated in 
the SSLConfig class. 

@arina-ielchiieva  please review 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/parthchandra/drill DRILL-5888

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1002.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1002


commit cded98dafa6fc548d140b15c36ecda1b560fb75c
Author: Parth Chandra 
Date:   2017-10-18T00:14:20Z

DRILL-5888: Remove dependency of SSLConfig on hadoop.security package. This 
allows jdbc-all-jar to be built without hadoop dependencies




---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2017-10-18 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1001

JIRA DRILL-5879: Like operator performance improvements

- Recently, custom code has been added to handle common search patterns 
(Like operator)
- Contains, Starts With, and Ends With
- Custom code is way faster than the generic RegEx based implementation
- This pull request is another attempt to improve the Contains pattern 
since it is more CPU intensive

Query: select  from table where colA like '%a%' or 
colA like '%xyz%';
Improvement Opportunities
Avoid isAscii computation (full access of the input string) since we're 
dealing with the same column twice
Optimize the "contains" for-loop
Implementation Details
1)
Added a new integer variable "asciiMode" to the VarCharHolder class
The default value is -1 which indicates this info is not known
Otherwise this value will be set to either 1 or 0 based on the string being 
in ASCII mode or Unicode
The execution plan already shares the same VarCharHolder instance for all 
evaluations of the same column value
The asciiMode will be correctly set during the first LIKE evaluation and 
will be reused across other LIKE evaluations
2)
The "Contains" LIKE operation is quite expensive as the code needs to 
access the input string to perform character based comparisons
Created 4 versions of the same for-loop to a) make the loop simpler to 
optimize (Vectorization) and b) minimize comparisons
Benchmarks
Lineitem table 100GB
Query: select l_returnflag, count from dfs.`` where l_comment not 
like '%a%' or l_comment like '%the%' group by l_returnflag
Before changes: 33sec
After changes : 27sec


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill yodlee-cherry-pick

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1001.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1001


commit c2b05b2e8665daf3f7b43d49c428539b3753595f
Author: Salim Achouche 
Date:   2017-10-18T18:40:18Z

JIRA 5879: Like operator performance improvements




---

[GitHub] drill issue #999: DRILL-5881:Java Client: [Threat Modeling] Drillbit may be ...

2017-10-18 Thread parthchandra

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/999
  
+1. 
You need a newline at the end :)


---

[GitHub] drill pull request #997: DRILL-5882: C++ Client: [Threat Modeling] Drillbit ...

2017-10-18 Thread parthchandra

Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/997#discussion_r145510774
  
--- Diff: contrib/native/client/src/clientlib/drillClientImpl.cpp ---
@@ -595,6 +611,12 @@ connectionStatus_t 
DrillClientImpl::validateHandshake(DrillUserProperties* prope
 
 switch(this->m_handshakeStatus) {
 case exec::user::SUCCESS:
+// Check if client needs auth/encryption and server is not 
requiring it
+if(clientNeedsAuthentication(properties) || 
clientNeedsEncryption(properties)) {
--- End diff --

Also, do you need two different messages : auth required and encr required ?


---

Re: Date Conversion Question

2017-10-18 Thread Charles Givre

Hi Julian, 
Alas, this doesn’t work in Drill since Drill uses Joda time formats.  However, 
you got me thinking about this and I actually got it to work w/o using the 
substring or other weird string manipulation functions.

SELECT to_timestamp ('2017-08-10T09:12:26.000Z', 
'-MM-dd''T''hh:mm:ss.SSS''Z''') FROM (VALUES(1))

Apparently, the double single quotes act as an escape character for plain text 
in the format string.  We really should make either the CAST() or the 
TO_TIMESTAMP a little easier to use as this is really counter-intuitive.
— C



> On Oct 18, 2017, at 12:47, Julian Hyde  wrote:
> 
> A question on StackOverflow asks how to do this using Oracle’s TO_TIMESTAMP 
> function, and there is a solution[1]. So, I tried
> 
>  SELECT to_timestamp ('2017-08-10T09:12:26.000Z', 
> '-MM-DD"T"HH24:MI:SS.FF3"Z"')
>  FROM DUAL
> 
> on http://rextester.com/l/oracle_online_compiler 
>  and it worked.
> 
> I presume Drill’s TO_TIMESTAMP is based is based on Oracle’s. In which case 
> let’s fix TO_TIMESTAMP.
> 
> Julian
> 
> [1] 
> https://stackoverflow.com/questions/26671557/convert-string-iso-8601-date-to-oracles-timestamp-datatype
>  
> 
>  
> 
>> On Oct 18, 2017, at 7:57 AM, Bob Rudis  wrote:
>> 
>> FWIW I was doing very similar substring (etc) machinations until we
>> started converting output from back-end data-generation tools directly
>> into parquet (using other tools). IMO it's a common enough format (at
>> least in the types of data you and I likely have to work with :-) that
>> it'd be great if there was direct support for it. If there is, I also
>> missed it and would also be most appreciative of which incantations to
>> use to take advantage of it.
>> 
>> On Wed, Oct 18, 2017 at 10:49 AM, Charles Givre  wrote:
>>> Hello Drillers,
>>> I have a silly question which I’m a little stuck with.  I have some data in 
>>> CSV format with dates in the following format:  2017-08-10T09:12:26.000Z.  
>>> I’m trying to convert this into a date time data field so that I have both 
>>> the date and the hours, however I keep running into road blocks.   I’ve 
>>> tried the CAST( field AS DATE ) but in doing so I lose the time component.  
>>> I’ve tried the TO_TIMESTAMP function, however the only success I’ve had is 
>>> using the substring function to remove the timezone at the end, then use 
>>> regex_replace to get rid of the literal ’T’ in the middle of the string, 
>>> then TO_NUMBER.  (See query below)
>>> 
>>> SELECT TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), 
>>> '-MM-dd HH:mm:ss'  ) AS dt,
>>> EXTRACT(
>>> hour FROM
>>> TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), 
>>> '-MM-dd HH:mm:ss'  )
>>> ) AS dt_hour
>>> 
>>> I’d think you could do this directly with the TO_TIMESTAMP function 
>>> however, I can’t figure out how include the literal ’T’ in the formatting 
>>> string.  The escape character seems to be the single quote which also is 
>>> the only character allowed to denote the formatting string.
>>> 
>>> So, questions:
>>> 1.  Is there any way to include a literal character in a joda date format?
>>> 2.  Is it possible to use any character besides a single quote to mark the 
>>> beginning/end of a format string?
>>> 3.  Are there any ways to do this that I’m missing?
>>> 
>>> Thanks!
>>> —C
>>> 
>

Re: Date Conversion Question

2017-10-18 Thread Julian Hyde

A question on StackOverflow asks how to do this using Oracle’s TO_TIMESTAMP 
function, and there is a solution[1]. So, I tried

  SELECT to_timestamp ('2017-08-10T09:12:26.000Z', 
'-MM-DD"T"HH24:MI:SS.FF3"Z"')
  FROM DUAL

on http://rextester.com/l/oracle_online_compiler 
 and it worked.

I presume Drill’s TO_TIMESTAMP is based is based on Oracle’s. In which case 
let’s fix TO_TIMESTAMP.

Julian
 
[1] 
https://stackoverflow.com/questions/26671557/convert-string-iso-8601-date-to-oracles-timestamp-datatype
 

 

> On Oct 18, 2017, at 7:57 AM, Bob Rudis  wrote:
> 
> FWIW I was doing very similar substring (etc) machinations until we
> started converting output from back-end data-generation tools directly
> into parquet (using other tools). IMO it's a common enough format (at
> least in the types of data you and I likely have to work with :-) that
> it'd be great if there was direct support for it. If there is, I also
> missed it and would also be most appreciative of which incantations to
> use to take advantage of it.
> 
> On Wed, Oct 18, 2017 at 10:49 AM, Charles Givre  wrote:
>> Hello Drillers,
>> I have a silly question which I’m a little stuck with.  I have some data in 
>> CSV format with dates in the following format:  2017-08-10T09:12:26.000Z.  
>> I’m trying to convert this into a date time data field so that I have both 
>> the date and the hours, however I keep running into road blocks.   I’ve 
>> tried the CAST( field AS DATE ) but in doing so I lose the time component.  
>> I’ve tried the TO_TIMESTAMP function, however the only success I’ve had is 
>> using the substring function to remove the timezone at the end, then use 
>> regex_replace to get rid of the literal ’T’ in the middle of the string, 
>> then TO_NUMBER.  (See query below)
>> 
>> SELECT TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), 
>> '-MM-dd HH:mm:ss'  ) AS dt,
>> EXTRACT(
>> hour FROM
>> TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), '-MM-dd 
>> HH:mm:ss'  )
>> ) AS dt_hour
>> 
>> I’d think you could do this directly with the TO_TIMESTAMP function however, 
>> I can’t figure out how include the literal ’T’ in the formatting string.  
>> The escape character seems to be the single quote which also is the only 
>> character allowed to denote the formatting string.
>> 
>> So, questions:
>> 1.  Is there any way to include a literal character in a joda date format?
>> 2.  Is it possible to use any character besides a single quote to mark the 
>> beginning/end of a format string?
>> 3.  Are there any ways to do this that I’m missing?
>> 
>> Thanks!
>> —C
>>

[GitHub] drill issue #904: DRILL-5717: change some date time test cases with specific...

2017-10-18 Thread vvysotskyi

Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/904
  
@weijietong which problems did you have when tried to mock 
`System.getProperty()` method?
As the example of nice mock of this method, you may use [this 
test](https://github.com/apache/drill/pull/936/files#diff-b053496903bff0dc154e42604b93b9efR60).
 Also besides the timezone, property with locale should be mocked.


---

[GitHub] drill issue #1000: DRILL-5845: Columns returned by select with "ORDER BY" an...

2017-10-18 Thread arina-ielchiieva

Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1000
  
+1, LGTM.


---

[GitHub] drill pull request #1000: DRILL-5845: Columns returned by select with "ORDER...

2017-10-18 Thread vdiravka

GitHub user vdiravka opened a pull request:

https://github.com/apache/drill/pull/1000

DRILL-5845: Columns returned by select with "ORDER BY" and "LIMIT" clâ¦

â¦auses are not in correct order.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vdiravka/drill DRILL-5845

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1000.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1000


commit 1847a65a322ed0f4b38dd202a79b8f2078ca1a11
Author: Vitalii Diravka 
Date:   2017-10-09T11:27:58Z

DRILL-5845: Columns returned by select with "ORDER BY" and "LIMIT" clauses 
are not in correct order.




---

Re: Date Conversion Question

2017-10-18 Thread Bob Rudis

FWIW I was doing very similar substring (etc) machinations until we
started converting output from back-end data-generation tools directly
into parquet (using other tools). IMO it's a common enough format (at
least in the types of data you and I likely have to work with :-) that
it'd be great if there was direct support for it. If there is, I also
missed it and would also be most appreciative of which incantations to
use to take advantage of it.

On Wed, Oct 18, 2017 at 10:49 AM, Charles Givre  wrote:
> Hello Drillers,
> I have a silly question which I’m a little stuck with.  I have some data in 
> CSV format with dates in the following format:  2017-08-10T09:12:26.000Z.  
> I’m trying to convert this into a date time data field so that I have both 
> the date and the hours, however I keep running into road blocks.   I’ve tried 
> the CAST( field AS DATE ) but in doing so I lose the time component.  I’ve 
> tried the TO_TIMESTAMP function, however the only success I’ve had is using 
> the substring function to remove the timezone at the end, then use 
> regex_replace to get rid of the literal ’T’ in the middle of the string, then 
> TO_NUMBER.  (See query below)
>
> SELECT TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), 
> '-MM-dd HH:mm:ss'  ) AS dt,
> EXTRACT(
> hour FROM
> TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), '-MM-dd 
> HH:mm:ss'  )
>  ) AS dt_hour
>
> I’d think you could do this directly with the TO_TIMESTAMP function however, 
> I can’t figure out how include the literal ’T’ in the formatting string.  The 
> escape character seems to be the single quote which also is the only 
> character allowed to denote the formatting string.
>
> So, questions:
> 1.  Is there any way to include a literal character in a joda date format?
> 2.  Is it possible to use any character besides a single quote to mark the 
> beginning/end of a format string?
> 3.  Are there any ways to do this that I’m missing?
>
> Thanks!
> —C
>

Date Conversion Question

2017-10-18 Thread Charles Givre

Hello Drillers, 
I have a silly question which I’m a little stuck with.  I have some data in CSV 
format with dates in the following format:  2017-08-10T09:12:26.000Z.  I’m 
trying to convert this into a date time data field so that I have both the date 
and the hours, however I keep running into road blocks.   I’ve tried the CAST( 
field AS DATE ) but in doing so I lose the time component.  I’ve tried the 
TO_TIMESTAMP function, however the only success I’ve had is using the substring 
function to remove the timezone at the end, then use regex_replace to get rid 
of the literal ’T’ in the middle of the string, then TO_NUMBER.  (See query 
below)

SELECT TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), 
'-MM-dd HH:mm:ss'  ) AS dt,
EXTRACT( 
hour FROM
TO_TIMESTAMP( REGEXP_REPLACE( SUBSTR( , 1,19), 'T', ' '), '-MM-dd 
HH:mm:ss'  )
 ) AS dt_hour

I’d think you could do this directly with the TO_TIMESTAMP function however, I 
can’t figure out how include the literal ’T’ in the formatting string.  The 
escape character seems to be the single quote which also is the only character 
allowed to denote the formatting string.  

So, questions:
1.  Is there any way to include a literal character in a joda date format?
2.  Is it possible to use any character besides a single quote to mark the 
beginning/end of a format string?
3.  Are there any ways to do this that I’m missing?

Thanks!
—C

97 matches

Mail list logo