date:20180710

[GitHub] vrozov commented on a change in pull request #1361: DRILL-6579: Added sanity checks to the Parquet reader to avoid infini…

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1361: DRILL-6579: Added sanity 
checks to the Parquet reader to avoid infini…
URL: https://github.com/apache/drill/pull/1361#discussion_r201574276
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBulkPageReader.java
 ##
 @@ -111,6 +112,8 @@ final void set(PageDataInfo pageInfoInput, boolean clear) {
   }
 
   final VarLenColumnBulkEntry getEntry(int valuesToRead) {
+Preconditions.checkArgument(valuesToRead > 0, "Number values to read 
cannot be zero");
 
 Review comment:
   Consider `String.format("Number of values to read %d must be positive", 
valuesToRead);`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ 
Client SSL Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201527041
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.cpp
 ##
 @@ -211,6 +211,21 @@ ChannelContext* 
ChannelFactory::getChannelContext(channelType_t t, DrillUserProp
 }
 
 pChannelContext = new SSLChannelContext(props, tlsVersion, 
verifyMode);
+
+if (props->isPropSet(USERPROP_CUSTOM_SSLCTXOPTIONS)){
 
 Review comment:
   This is an optional settings that the client application can set on top of 
the default. And also based on the version of boost the client is compiled 
against, some configurations are not available. So the user using this function 
should be aware of the limitation of the version of boost the client is 
compiled against and set the flags accordingly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] vrozov commented on a change in pull request #1361: DRILL-6579: Added sanity checks to the Parquet reader to avoid infini…

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1361: DRILL-6579: Added sanity 
checks to the Parquet reader to avoid infini…
URL: https://github.com/apache/drill/pull/1361#discussion_r201572784
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/BatchReader.java
 ##
 @@ -40,8 +40,11 @@ public int readBatch() throws Exception {
 ColumnReader firstColumnStatus = readState.getFirstColumnReader();
 int currBatchNumRecords = 
readState.batchSizerMgr().getCurrentRecordsPerBatch();
 long recordsToRead = Math.min(currBatchNumRecords, 
readState.getRemainingValuesToRead());
-int readCount = readRecords(firstColumnStatus, recordsToRead);
+int readCount = 0;
 
 Review comment:
   use `?`: `int readCount = recordsToRead > 0 ? ...`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami commented on a change in pull request #1373: DRILL-6517: Hash-Join: If not OK, exit early from prefetchFirstBatchFromBothSides

2018-07-10 Thread GitBox

sohami commented on a change in pull request #1373: DRILL-6517: Hash-Join: If 
not OK, exit early from prefetchFirstBatchFromBothSides
URL: https://github.com/apache/drill/pull/1373#discussion_r201563764
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 ##
 @@ -258,6 +251,13 @@ protected boolean prefetchFirstBatchFromBothSides() {
   return false;
 
 Review comment:
   AFAIK NONE is not supposed to have a valid incoming container associated 
with it 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] Ben-Zvi commented on issue #1373: DRILL-6517: Hash-Join: If not OK, exit early from prefetchFirstBatchFromBothSides

2018-07-10 Thread GitBox

Ben-Zvi commented on issue #1373: DRILL-6517: Hash-Join: If not OK, exit early 
from prefetchFirstBatchFromBothSides
URL: https://github.com/apache/drill/pull/1373#issuecomment-404023176
 
 
   Thanks @ppadma ; lets have @ilooner look at this too since he made some of 
the relevant changes (e.g., *getRecordCount()* for *RemovingRecordBatch*).
   The initial thesis was that this bug was triggering DRILL-6453 (query fails 
after 2h11m). Unfortunately now with this fix it seems the other way around 
(that failure caused those STOP outcomes).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sachouche commented on issue #1355: DRILL-6560: Enhanced the batch statistics logging enablement

2018-07-10 Thread GitBox

sachouche commented on issue #1355: DRILL-6560: Enhanced the batch statistics 
logging enablement
URL: https://github.com/apache/drill/pull/1355#issuecomment-404015591
 
 
   @bitblender and @ilooner 
   I have responded and implemented Karthik's feedback. Can you please approve 
before the drill 1.14 deadline?
   
   Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] ppadma commented on issue #1373: DRILL-6517: Hash-Join: If not OK, exit early from prefetchFirstBatchFromBothSides

2018-07-10 Thread GitBox

ppadma commented on issue #1373: DRILL-6517: Hash-Join: If not OK, exit early 
from prefetchFirstBatchFromBothSides
URL: https://github.com/apache/drill/pull/1373#issuecomment-404013626
 
 
   +1. LGTM. Thanks for fixing this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] superbstreak commented on issue #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

superbstreak commented on issue #1366: [DRILL-6581] C++ Client SSL 
Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#issuecomment-404013127
 
 
   @sohami updated. Please review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201539246
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/sys/BitToUserConnectionIterator.java
 ##
 @@ -99,14 +102,18 @@ public void remove() {
 
   public static class ConnectionInfo {
 public String user;
+@Nonnull
 
 Review comment:
   Thanks for pointing out the availability of these annotations. Didn't want 
to reinvent the wheel. :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201539183
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/RecordDataType.java
 ##
 @@ -47,17 +48,22 @@
* @return the constructed type
*/
   public final RelDataType getRowType(RelDataTypeFactory factory) {
-final List types = getFieldSqlTypeNames();
+final List> types = 
getFieldSqlTypeNames();
 
 Review comment:
   There are just 2 lists : `names` and `types`. 
   The third list (`fields`) is used as `factory.createStructType(fields, 
names)`
   I'm reluctant to change abstract methods because they might have usage 
beyond the current SystemTable context.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201533730
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/RecordDataType.java
 ##
 @@ -49,15 +55,25 @@
   public final RelDataType getRowType(RelDataTypeFactory factory) {
 final List types = getFieldSqlTypeNames();
 final List names = getFieldNames();
+final List nullables = getFieldNullability();
 final List fields = Lists.newArrayList();
-for (final SqlTypeName typeName : types) {
+Iterator typesIter = types.listIterator();
+Iterator nullabilityIter = nullables.listIterator();
 
 Review comment:
   This is, in fact, the gist of the comment. While it is understandable to 
want to leave he existing code as-is, adding more lists to the existing 
parallel lists is not an improvement: it has caused a semi-wrong initial 
implementation to become overly cumbersome.
   
   If the implementation is local to this class or module, then leaving the 
code cleaner after your change than when you found it is generally a good thing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201533864
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/RecordDataType.java
 ##
 @@ -47,17 +48,22 @@
* @return the constructed type
*/
   public final RelDataType getRowType(RelDataTypeFactory factory) {
-final List types = getFieldSqlTypeNames();
+final List> types = 
getFieldSqlTypeNames();
 
 Review comment:
   Better, but we still have three correlated lists. Can you combine the four 
fields into a single struct (class) and have a single list?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201533958
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/sys/BitToUserConnectionIterator.java
 ##
 @@ -99,14 +102,18 @@ public void remove() {
 
   public static class ConnectionInfo {
 public String user;
+@Nonnull
 
 Review comment:
   Nice use of the Java native annotation!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on issue #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on issue #1371: DRILL-6588: Make Sys tables of nullable 
datatypes
URL: https://github.com/apache/drill/pull/1371#issuecomment-404005272
 
 
   @paul-rogers I've made the changes to use Java's Annotation class `Nonnull`
   Please review the rest of the changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] Ben-Zvi commented on a change in pull request #1373: DRILL-6517: Hash-Join: If not OK, exit early from prefetchFirstBatchFromBothSides

2018-07-10 Thread GitBox

Ben-Zvi commented on a change in pull request #1373: DRILL-6517: Hash-Join: If 
not OK, exit early from prefetchFirstBatchFromBothSides
URL: https://github.com/apache/drill/pull/1373#discussion_r201531287
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 ##
 @@ -258,6 +251,13 @@ protected boolean prefetchFirstBatchFromBothSides() {
   return false;
 
 Review comment:
   The case of only one side receiving NONE is legit -- for those 
RIGHT/LEFT/FULL OUTER joins, where the other side is still returned.
   As for the update() on a NONE -- probably any setting of NONE also 
initializes the container's record count to zero.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] Ben-Zvi commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

Ben-Zvi commented on a change in pull request #1358:  DRILL-6516: EMIT support 
in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201529291
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/lateraljoin/TestE2EUnnestAndLateral.java
 ##
 @@ -394,4 +394,47 @@ public void testLateral_HashAgg_with_nulls() throws 
Exception {
   .baselineValues("dd",222L)
   .build().run();
   }
+
+  @Test
+  public void testMultipleBatchesLateral_WithStreamingAgg() throws Exception {
+String sql = "SELECT t2.maxprice FROM (SELECT customer.c_orders AS 
c_orders FROM "
++ "dfs.`lateraljoin/multipleFiles/` customer) t1, LATERAL (SELECT 
CAST(MAX(t.ord.o_totalprice)"
++ " AS int) AS maxprice FROM UNNEST(t1.c_orders) t(ord) GROUP BY 
t.ord.o_orderstatus) t2";
+
+testBuilder()
+.optionSettingQueriesForTestQuery("alter session set `%s` = true",
+PlannerSettings.STREAMAGG.getOptionName())
+.sqlQuery(sql)
+.unOrdered()
+.baselineColumns("maxprice")
+.baselineValues(367190)
+.baselineValues(316347)
+.baselineValues(146610)
+.baselineValues(306996)
+.baselineValues(235695)
+.baselineValues(177819)
+.build().run();
+  }
+
+  @Test
+  public void testLateral_StreamingAgg_with_nulls() throws Exception {
+String sql = "SELECT key, t3.dsls FROM cp.`lateraljoin/with_nulls.json` t 
LEFT OUTER "
++ "JOIN LATERAL (SELECT DISTINCT t2.sls AS dsls FROM UNNEST(t.sales) 
t2(sls)) t3 ON TRUE";
+
+testBuilder()
+.optionSettingQueriesForTestQuery("alter session set `%s` = true",
+PlannerSettings.STREAMAGG.getOptionName())
+.sqlQuery(sql)
+.unOrdered()
+.baselineColumns("key","dsls")
+.baselineValues("aa",null)
+.baselineValues("bb",100L)
+.baselineValues("bb",200L)
+.baselineValues("bb",300L)
+.baselineValues("bb",400L)
+.baselineValues("cc",null)
+.baselineValues("dd",111L)
+.baselineValues("dd",222L)
+.build().run();
+  }
 }
 
 Review comment:
   That's MAX, not COUNT; though the difference should only be in the generated 
code for the aggregation function. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ 
Client SSL Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201528448
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.hpp
 ##
 @@ -21,6 +21,17 @@
 #include "drill/common.hpp"
 #include "drill/drillClient.hpp"
 #include "streamSocket.hpp"
+#include "errmsgs.hpp"
+
+#if defined(IS_SSL_ENABLED)
+#include 
+#endif
+
+namespace
+{
+// The error message to indicate certificate verification failure.
+#define DRILL_BOOST_SSL_CERT_VERIFY_FAILED  "handshake: certificate verify 
failed\0"
 
 Review comment:
   Yup, I was concern about this bit, too. Will do, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami commented on a change in pull request #1373: DRILL-6517: Hash-Join: If not OK, exit early from prefetchFirstBatchFromBothSides

2018-07-10 Thread GitBox

sohami commented on a change in pull request #1373: DRILL-6517: Hash-Join: If 
not OK, exit early from prefetchFirstBatchFromBothSides
URL: https://github.com/apache/drill/pull/1373#discussion_r201528460
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 ##
 @@ -258,6 +251,13 @@ protected boolean prefetchFirstBatchFromBothSides() {
   return false;
 
 Review comment:
   `checkForEarlyFinish` checks for NONE from both side. What if only one side 
receives NONE? In that case also `BatchMemoryManager.update()` should not be 
called


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201527711
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pojo/PojoDataType.java
 ##
 @@ -36,6 +36,7 @@
 
   private final List types = Lists.newArrayList();
   private final List names = Lists.newArrayList();
+  private final List nullables = Lists.newArrayList();
 
 Review comment:
   RecordDataType uses the `names` and `types` lists separately. So, I'll need 
atleast 2 lists.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201527575
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pojo/Nullability.java
 ##
 @@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pojo;
+
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
+
+/**
+ * Indicates Nullability
+ */
+@Retention(RetentionPolicy.RUNTIME)
+@Target(ElementType.FIELD)
+public @interface Nullability {
 
 Review comment:
   That would help. Let me take a look. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201527356
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pojo/PojoDataType.java
 ##
 @@ -48,6 +49,11 @@ public PojoDataType(Class pojoClass) {
   Class type = f.getType();
   names.add(f.getName());
 
+  //Look up @Nullability for nullable property
+  Nullability nullability = f.getDeclaredAnnotation(Nullability.class);
+  nullables.add(nullability == null ? //Absence of annotation => 
(isNullable=true)
 
 Review comment:
   It seemed very verbose, but that makes sense. I'll make this change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ 
Client SSL Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201527169
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.hpp
 ##
 @@ -199,6 +248,29 @@ class UserProperties;
 :Channel(ioService, host, port){
 }
 connectionStatus_t init();
+protected:
+/// @brief Handle protocol handshake exceptions for SSL specific 
failures.
+/// 
+/// @param in_errmsgThe error message.
+/// 
+/// @return the connectionStatus.
+connectionStatus_t HandleProtocolHandshakeException(const char* 
errmsg) {
+if (!(((SSLChannelContext_t 
*)m_pContext)->GetCertificateHostnameVerificationStatus())){
+return handleError(
+CONN_HANDSHAKE_FAILED,
+getMessage(ERR_CONN_SSL_CN));
 
 Review comment:
   Will do Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201527129
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/RecordDataType.java
 ##
 @@ -49,15 +55,25 @@
   public final RelDataType getRowType(RelDataTypeFactory factory) {
 final List types = getFieldSqlTypeNames();
 final List names = getFieldNames();
+final List nullables = getFieldNullability();
 final List fields = Lists.newArrayList();
-for (final SqlTypeName typeName : types) {
+Iterator typesIter = types.listIterator();
+Iterator nullabilityIter = nullables.listIterator();
 
 Review comment:
   I wanted to avoid changing existing implementation, so I added an extra list 
and iterated in parallel.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ 
Client SSL Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201527041
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.cpp
 ##
 @@ -211,6 +211,21 @@ ChannelContext* 
ChannelFactory::getChannelContext(channelType_t t, DrillUserProp
 }
 
 pChannelContext = new SSLChannelContext(props, tlsVersion, 
verifyMode);
+
+if (props->isPropSet(USERPROP_CUSTOM_SSLCTXOPTIONS)){
 
 Review comment:
   This is an optional settings that the client application can set since the 
default configuration from boost is somewhat limited. And also based on the 
version of boost the client is compiled against, some configurations are not 
available. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ 
Client SSL Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201527105
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.hpp
 ##
 @@ -215,6 +287,52 @@ class UserProperties;
 static ChannelContext_t* getChannelContext(channelType_t t, 
DrillUserProperties* props);
 };
 
+/// @brief Hostname verification callback wrapper.
+class DrillSSLHostnameVerifier{
+public:
+/// @brief The constructor.
+/// 
+/// @param in_channel  The Channel.
+DrillSSLHostnameVerifier(Channel* in_channel) : 
m_channel(in_channel){
+DRILL_LOG(LOG_INFO)
+<< "DrillSSLHostnameVerifier::DrillSSLHostnameVerifier: 
+ Enter +" 
+<< std::endl;
+}
+
+/// @brief Perform certificate verification.
+/// 
+/// @param in_preverified   Pre-verified indicator.
+/// @param in_ctx   Verify context.
+bool operator()(
+bool in_preverified,
+boost::asio::ssl::verify_context& in_ctx){
+DRILL_LOG(LOG_INFO) << "DrillSSLHostnameVerifier::operator(): 
+ Enter +" << std::endl;
+
+// Gets the channel context.
+SSLChannelContext_t* context = 
(SSLChannelContext_t*)(m_channel->getChannelContext());
+
+// Retrieve the host before we perform Host name verification.
+// This is because host with ZK mode is selected after the 
connect() function is called.
+boost::asio::ssl::rfc2818_verification 
verifier(m_channel->getEndpoint()->getHost().c_str());
+
+// Perform verification.
+bool verified = verifier(in_preverified, in_ctx);
+
+DRILL_LOG(LOG_DEBUG) 
+<< "DrillSSLHostnameVerifier::operator(): Verification 
Result: " 
+<< verified 
+<< std::endl;
+
+// Sets the result back to the context.
+context->SetCertHostnameVerificationStatus(verified);
+return verified && in_preverified;
 
 Review comment:
   Oh yes, redundant! Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on issue #1348: DRILL-6346: Create an Official Drill Docker Container

2018-07-10 Thread GitBox

paul-rogers commented on issue #1348: DRILL-6346: Create an Official Drill 
Docker Container
URL: https://github.com/apache/drill/pull/1348#issuecomment-403997520
 
 
   If running a server, we may want to let the Drillbit shutdown gracefully. To 
do this, intercept the `SIGTERM` coming into the container and forward it to 
the process. If the `start.sh` script `exec`s the Drillbit, then the Drillbit 
may handle the shutdown. Otherwise, you can have the `start.sh` [handle the 
SIGTERM](https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86).
   
   The normal `drillbit.sh` script appears to do an `exec`, so the Drill server 
should end up as pid 1 (if your `start.sh` script does an `exec` to 
`drillbit.sh`.) But, check this in case anything has changed since we looked 
into this stuff for Drill-on-YARN. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

superbstreak commented on a change in pull request #1366: [DRILL-6581] C++ 
Client SSL Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201526439
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.cpp
 ##
 @@ -211,6 +211,21 @@ ChannelContext* 
ChannelFactory::getChannelContext(channelType_t t, DrillUserProp
 }
 
 pChannelContext = new SSLChannelContext(props, tlsVersion, 
verifyMode);
+
+if (props->isPropSet(USERPROP_CUSTOM_SSLCTXOPTIONS)){
 
 Review comment:
   Thanks! 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on issue #1348: DRILL-6346: Create an Official Drill Docker Container

2018-07-10 Thread GitBox

paul-rogers commented on issue #1348: DRILL-6346: Create an Official Drill 
Docker Container
URL: https://github.com/apache/drill/pull/1348#issuecomment-403997146
 
 
   Note also that if any of the processes can launch child processes, something 
in the container must reap zombies. Bash will do it as will a utility called 
[`tini`](https://github.com/krallin/tini).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on issue #1348: DRILL-6346: Create an Official Drill Docker Container

2018-07-10 Thread GitBox

paul-rogers commented on issue #1348: DRILL-6346: Create an Official Drill 
Docker Container
URL: https://github.com/apache/drill/pull/1348#issuecomment-403997092
 
 
   Another thought, FWIW. The container offered here is for the embedded 
Drillbit. A previous comment suggested it would be handy to run an actual Drill 
server (Drillbit). It would be further handy to run a Sqlline that connects to 
a remote Drillbit.
   
   These can be done by creating a `start.sh` or `entrypoint.sh` script in the 
container that parses arguments, then does what is requested. For example:
   
   ```
   start.sh [drillbit|embedded|sqlline]  ...
   ```
   
   Then, have `start.sh` do a switch on the first arg (with a pattern that 
accepts d*, e* or s*) and passes the args to that process. (For Sqlline, we 
need the JDBC arguments.)
   
   Might be helpful to see the Kubernetes Dockerfile provided with the Spark 
2.3/2.4 release since they had to wrestle with these same issues (launch Driver 
or Executor, later extended to History Server, etc.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami opened a new pull request #1374: DRILL-6542 : IndexOutOfBoundsException for multilevel lateral queries…

2018-07-10 Thread GitBox

sohami opened a new pull request #1374: DRILL-6542 : IndexOutOfBoundsException 
for multilevel lateral queries…
URL: https://github.com/apache/drill/pull/1374
 
 
   … with schema changed partitioned complex data


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami commented on issue #1374: DRILL-6542 : IndexOutOfBoundsException for multilevel lateral queries…

2018-07-10 Thread GitBox

sohami commented on issue #1374: DRILL-6542 : IndexOutOfBoundsException for 
multilevel lateral queries…
URL: https://github.com/apache/drill/pull/1374#issuecomment-403996831
 
 
   @parthchandra - Please help to review this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] Ben-Zvi opened a new pull request #1373: DRILL-6517: Hash-Join: If not OK, exit early from prefetchFirstBatchFromBothSides

2018-07-10 Thread GitBox

Ben-Zvi opened a new pull request #1373: DRILL-6517: Hash-Join: If not OK, exit 
early from prefetchFirstBatchFromBothSides
URL: https://github.com/apache/drill/pull/1373
 
 
When a running query is cancelled (e.g., when Disk is full while spilling), 
many next() calls return a STOP outcome. Any Hash-Join that is "sniffing" its 
incoming batches (by repeatedly calling next()) would proceed to update the 
memory manager regardless of the outcomes returned.
  In some abnormal outcomes (like STOP), the batch's container is **not 
initialized**. The update call needs the batch's row count, which for 
*RemovingRecordBatch* is implemented (differently) by invoking the container's 
getRowCount() -- so in this case the call fails.
 The Fix: Move all the checks for the abnormal conditions immediately after 
the "sniffing", thus returning early and avoiding the update of the memory 
manager.
 Longer term question: How to implement getRowCount() for the record batch 
consistently.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL 
Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201519782
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.hpp
 ##
 @@ -215,6 +287,52 @@ class UserProperties;
 static ChannelContext_t* getChannelContext(channelType_t t, 
DrillUserProperties* props);
 };
 
+/// @brief Hostname verification callback wrapper.
+class DrillSSLHostnameVerifier{
+public:
+/// @brief The constructor.
+/// 
+/// @param in_channel  The Channel.
+DrillSSLHostnameVerifier(Channel* in_channel) : 
m_channel(in_channel){
+DRILL_LOG(LOG_INFO)
+<< "DrillSSLHostnameVerifier::DrillSSLHostnameVerifier: 
+ Enter +" 
+<< std::endl;
+}
+
+/// @brief Perform certificate verification.
+/// 
+/// @param in_preverified   Pre-verified indicator.
+/// @param in_ctx   Verify context.
+bool operator()(
+bool in_preverified,
+boost::asio::ssl::verify_context& in_ctx){
+DRILL_LOG(LOG_INFO) << "DrillSSLHostnameVerifier::operator(): 
+ Enter +" << std::endl;
+
+// Gets the channel context.
+SSLChannelContext_t* context = 
(SSLChannelContext_t*)(m_channel->getChannelContext());
+
+// Retrieve the host before we perform Host name verification.
+// This is because host with ZK mode is selected after the 
connect() function is called.
+boost::asio::ssl::rfc2818_verification 
verifier(m_channel->getEndpoint()->getHost().c_str());
+
+// Perform verification.
+bool verified = verifier(in_preverified, in_ctx);
+
+DRILL_LOG(LOG_DEBUG) 
+<< "DrillSSLHostnameVerifier::operator(): Verification 
Result: " 
+<< verified 
+<< std::endl;
+
+// Sets the result back to the context.
+context->SetCertHostnameVerificationStatus(verified);
+return verified && in_preverified;
 
 Review comment:
   I think we should just return the `verified` status here not `(verified && 
in_preverified)`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL 
Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201524731
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.hpp
 ##
 @@ -21,6 +21,17 @@
 #include "drill/common.hpp"
 #include "drill/drillClient.hpp"
 #include "streamSocket.hpp"
+#include "errmsgs.hpp"
+
+#if defined(IS_SSL_ENABLED)
+#include 
+#endif
+
+namespace
+{
+// The error message to indicate certificate verification failure.
+#define DRILL_BOOST_SSL_CERT_VERIFY_FAILED  "handshake: certificate verify 
failed\0"
 
 Review comment:
   I don't think we can rely on this error string. Instead it would be good to 
use something like below for decoding ssl errors.
   
https://stackoverflow.com/questions/9828066/how-to-decipher-a-boost-asio-ssl-error-code


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL 
Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201521602
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.hpp
 ##
 @@ -199,6 +248,29 @@ class UserProperties;
 :Channel(ioService, host, port){
 }
 connectionStatus_t init();
+protected:
+/// @brief Handle protocol handshake exceptions for SSL specific 
failures.
+/// 
+/// @param in_errmsgThe error message.
+/// 
+/// @return the connectionStatus.
+connectionStatus_t HandleProtocolHandshakeException(const char* 
errmsg) {
+if (!(((SSLChannelContext_t 
*)m_pContext)->GetCertificateHostnameVerificationStatus())){
+return handleError(
+CONN_HANDSHAKE_FAILED,
+getMessage(ERR_CONN_SSL_CN));
 
 Review comment:
   Would be useful to preserve the original error message in this case as well. 
Please change to `getMessage(ERR_CONN_SSL_CN, errmsg)`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL Implementation Fixes/Improvements

2018-07-10 Thread GitBox

sohami commented on a change in pull request #1366: [DRILL-6581] C++ Client SSL 
Implementation Fixes/Improvements
URL: https://github.com/apache/drill/pull/1366#discussion_r201435888
 
 

 ##
 File path: contrib/native/client/src/clientlib/channel.cpp
 ##
 @@ -211,6 +211,21 @@ ChannelContext* 
ChannelFactory::getChannelContext(channelType_t t, DrillUserProp
 }
 
 pChannelContext = new SSLChannelContext(props, tlsVersion, 
verifyMode);
+
+if (props->isPropSet(USERPROP_CUSTOM_SSLCTXOPTIONS)){
 
 Review comment:
   No need to check `isPropSet` since you are already checking for 
`!sslOptions.empty()` below.
   
   Also I have a question regarding these custom SSL Context options. Based on 
documentation 
[here](https://www.openssl.org/docs/man1.0.2/ssl/SSL_CTX_set_options.html) it 
helps to achieve workaround for listed bugs. But that depends upon the internal 
implementation of SSL if handling for those work around is available or not. If 
the handling is available then it will be used since this option is set during 
`m_SSLContext` creation (See 
[here](https://github.com/apache/drill/pull/1366/commits/093b4cadb4653b2e51fa13e7baadef3d0d6b8c91#diff-4649cdc0895f6abaeb47ff3f6a10eec4R104)).
 So it doesn't look like we need to have this separate custom option setter ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201523513
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pojo/PojoDataType.java
 ##
 @@ -48,6 +49,11 @@ public PojoDataType(Class pojoClass) {
   Class type = f.getType();
   names.add(f.getName());
 
+  //Look up @Nullability for nullable property
+  Nullability nullability = f.getDeclaredAnnotation(Nullability.class);
+  nullables.add(nullability == null ? //Absence of annotation => 
(isNullable=true)
 
 Review comment:
   `f.isAnnotationPresent(Nullability.class)` returns a `boolean` result; may 
simplify the code here


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201523061
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pojo/Nullability.java
 ##
 @@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.pojo;
+
+import java.lang.annotation.ElementType;
+import java.lang.annotation.Retention;
+import java.lang.annotation.RetentionPolicy;
+import java.lang.annotation.Target;
+
+/**
+ * Indicates Nullability
+ */
+@Retention(RetentionPolicy.RUNTIME)
+@Target(ElementType.FIELD)
+public @interface Nullability {
 
 Review comment:
   Turns out that Java has a existing `@Nonnull` and `@Nullable` annotations in 
`java.annotation`. Can we use those?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201522572
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/RecordDataType.java
 ##
 @@ -49,15 +55,25 @@
   public final RelDataType getRowType(RelDataTypeFactory factory) {
 final List types = getFieldSqlTypeNames();
 final List names = getFieldNames();
+final List nullables = getFieldNullability();
 final List fields = Lists.newArrayList();
-for (final SqlTypeName typeName : types) {
+Iterator typesIter = types.listIterator();
+Iterator nullabilityIter = nullables.listIterator();
 
 Review comment:
   We now four correlated lists. This is often difficult to reason about and 
maintain.
   
   Is it possible to have a single list of, say, `FieldDefn` objects, each of 
which has a name, type, nullable and `RelDataType`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201523626
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pojo/PojoDataType.java
 ##
 @@ -84,4 +90,8 @@ public PojoDataType(Class pojoClass) {
 return names;
   }
 
+  @Override
+  public List getFieldNullability() {
 
 Review comment:
   Same comment about the awkwardness of correlated lists...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

paul-rogers commented on a change in pull request #1371: DRILL-6588: Make Sys 
tables of nullable datatypes
URL: https://github.com/apache/drill/pull/1371#discussion_r201523179
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/pojo/PojoDataType.java
 ##
 @@ -36,6 +36,7 @@
 
   private final List types = Lists.newArrayList();
   private final List names = Lists.newArrayList();
+  private final List nullables = Lists.newArrayList();
 
 Review comment:
   Again, single list of objects rather than three correlated list of 
primitives?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] gparai opened a new pull request #1372: DRILL-6589: Push transitive closure predicates past aggregates/projects

2018-07-10 Thread GitBox

gparai opened a new pull request #1372: DRILL-6589: Push transitive closure 
predicates past aggregates/projects
URL: https://github.com/apache/drill/pull/1372
 
 
   @amansinha100 / @vdiravka please review the PR. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader in all non-complex column queries

2018-07-10 Thread GitBox

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader in all non-complex column queries
URL: https://github.com/apache/drill/pull/1370#discussion_r201517913
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
 
 Review comment:
   I will fix this and all other test issues you highlighted. When fixing them 
I realised that existing `complex.parquet` test file possibly doesn't provide 
desired coverage for changes in this PR. Bulk part of changes are aimed at 
making utility functions work with schemas like 
   `a`
   `b`.`a`
   while schema in `complex.parquet` doesn't contain such corner cases.
   Does it make sense to add new resource to cover such cases and base tests on 
that file?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] priteshm commented on issue #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

priteshm commented on issue #1358:  DRILL-6516: EMIT support in streaming agg
URL: https://github.com/apache/drill/pull/1358#issuecomment-403981936
 
 
   @Ben-Zvi any more comments on this one?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (DRILL-6589) Push transitive closure generated predicates past aggregates/projects

2018-07-10 Thread Gautam Kumar Parai (JIRA)

Gautam Kumar Parai created DRILL-6589:
-

 Summary: Push transitive closure generated predicates past 
aggregates/projects
 Key: DRILL-6589
 URL: https://issues.apache.org/jira/browse/DRILL-6589
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.13.0
Reporter: Gautam Kumar Parai
Assignee: Gautam Kumar Parai
 Fix For: 1.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [DISCUSS] 1.14.0 release

2018-07-10 Thread Kunal Khatua

I have 2 fixes (minor UX and a trivial SysTable patch) that are needed for 
1.14.0
DRILL-6583 (already has a +1) 
DRILL-6588 (pending review)

~Kunal

On 7/10/2018 12:02:36 AM, Arina Yelchiyeva  wrote:
Hi Boaz,

as far as I understand you either move this feature to 1.15 or wait with
building RC until it's finished. Adding changes later means, you'll have to
build one more RC candidate and conduct one more vote.
I think making people voting twice instead of once, it's not a good thing :)


Kind regards,
Arina

On Tue, Jul 10, 2018 at 8:07 AM Boaz Ben-Zvi wrote:

> Hi Charles,
>
> The main reason for rushing a Release Candidate is that we could
> give it enough testing.
>
> Given that DRILL-6104 is a separate feature, with almost no impact on
> the current code, then it seems low risk to add it a few days later.
>
> Anyone has an objection ?
>
> Boaz
>
> On 7/9/18 9:54 PM, Charles Givre wrote:
> > Hi Boaz,
> > I’m traveling at the moment, but I can have DRILL-6104 back in Paul’s
> hands by the end of the week.
> > —C
> >
> >> On Jul 10, 2018, at 00:53, Boaz Ben-Zvi wrote:
> >>
> >> We are making progress towards 1.14.
> >>
> >> Let's aim for a Release Candidate branch off on Thursday (July 12) !!!
> >>
> >> Below are the unfinished cases; can most be completed and checked in by
> 7/12 ?
> >>
> >> (( Relevant people:
> >>
> >> Abhishek, Arina, Boaz, Charles, Hanumath, Jean-Blas, Karthik,
> Kunal,
> >>
> >> Parth, Paul, Salim, Sorabh, Tim, Vitalii, Vlad, Volodymyr ))
> >>
> >> ==
> >>
> >> Open/blocker - DRILL-6453 + DRILL-6517:
> >> Two issues - Parquet Scanner (?) not setting container's record num
> (to zero), and a hang following this failure.
> >> Currently testing a fix / workaround ((Boaz))
> >>
> >> In Progress - DRILL-6104: Generic Logfile Format Plugin ((Charles +
> Paul -- can you be done by 7/12 ?))
> >>
> >> PR - DRILL-5999 (DRILL-6516): Support for EMIT outcome in Streaming Agg
> ((Parth + Boaz reviewing))
> >>
> >> Open - DRILL-6542: Index out of bounds ((Sorabh))
> >>
> >> Open - DRILL-6475: Unnest Null fieldId pointer ((Hanumath))
> >>
> >>  The following PRs are still waiting for reviews 
> >>
> >> DRILL-6583: UI usability issue ((Kunal / Sorabh))
> >>
> >> DRILL-6579: Add sanity checks to the Parquet Reader ((Salim / Vlad +
> Boaz))
> >>
> >> DRILL-6578: handle query cancellation in Parquet Reader ((Salim / Vlad
> + Boaz))
> >>
> >> DRILL-6560: Allow options for controlling the batch size per operator
> ((Salim / Karthik))
> >>
> >> DRILL-6559: Travis timing out ((Vitalii / Tim))
> >>
> >> DRILL-6496: VectorUtil.showVectorAccessibleContent does not log vector
> content ((Tim / Volodymyr))
> >>
> >> DRILL-6410: Memory Leak in Parquet Reader during cancellation ((Vlad /
> Parth))
> >>
> >> DRILL-6346: Create an Official Drill Docker Container ((Abhishek / Tim))
> >>
> >> DRILL-6179: Added pcapng-format support ((Vlad / Paul))
> >>
> >> DRILL-5796: Filter pruning for multi rowgroup parquet file ((Jean-Blas
> / Arina))
> >>
> >> DRILL-5365: FileNotFoundException when reading a parquet file ((Tim /
> Vitalii))
> >>
> >> ==
> >>
> >> Thanks,
> >>
> >> Boaz
> >>
> >> On 7/6/18 2:51 PM, Pritesh Maker wrote:
> >>> Here is the release 1.14 dashboard (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_secure_Dashboard.jspa-3FselectPageId-3D12332463&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=7lXQnf0aC8VQ0iMXwVgNHw&m=V0b4-BeuIMwRczzbiSXCgL7Z7f8lrmLBGH1vnSHLjB4&s=HRQU6Q4umbONtN4EqY3ryggJNEyCOghAzICypRJOels&e=
> ) and agile board (
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_secure_RapidBoard.jspa-3FrapidView-3D185&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=7lXQnf0aC8VQ0iMXwVgNHw&m=V0b4-BeuIMwRczzbiSXCgL7Z7f8lrmLBGH1vnSHLjB4&s=GKxSMl97YRnHJu-AL-A-vvRe5SXqw7vdDPDzMzj-Cj4&e=
> )
> >>>
> >>> I believe Volodymyr is targeting DRILL-6422 (Guava update) for 1.15
> release so it shouldn't be blocking the release. So overall, we have 2 open
> bugs, 2 in progress bugs (+2 doc issues), and 12 in review (+1 ready to
> commit).
> >>>
> >>> If the reviewable commits won't be ready soon, can the developers
> please remove the 1.14 fix version for these issues.
> >>>
> >>> Pritesh
> >>>
> >>>
> >>>
> >>>
> >>> On 7/6/18, 11:54 AM, "Boaz Ben-Zvi"
> b...@mapr.com> wrote:
> >>>
> >>> Current status: There's a blocker, and some work in progress
> that will
> >>> stretch into next week.
> >>> Current detail:
> >>> ==
> >>> Open/blocker - DRILL-6453 + DRILL-6517: Two issues - Parquet
> Scanner not setting record num (to zero), and a hang following this failure.
> >>> In Progress - DRILL-6104: Generic Logfile Format Plugin
> >>> PR - DRILL-6422: Update Guava to 23.0 and shade it
> >>> PR - DRILL-5999 (DRILL-6516): Support for EMIT outcome in
> Streaming Agg (I'm reviewing)
> >>> Ready2Commit: DRILL-6519: Add String Distance and
> Phonetic Functions (Arina gave it a +1 ; is it "Ready-To-Commit"

[GitHub] sachouche commented on issue #1360: DRILL-6578: Handle query cancellation in Parquet reader

2018-07-10 Thread GitBox

sachouche commented on issue #1360: DRILL-6578: Handle query cancellation in 
Parquet reader
URL: https://github.com/apache/drill/pull/1360#issuecomment-403957132
 
 
   @vrozov,
   can you please review this PR? I have implemented the changes as agreed 
(last week). 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sachouche commented on issue #1361: DRILL-6579: Added sanity checks to the Parquet reader to avoid infini…

2018-07-10 Thread GitBox

sachouche commented on issue #1361: DRILL-6579: Added sanity checks to the 
Parquet reader to avoid infini…
URL: https://github.com/apache/drill/pull/1361#issuecomment-403956690
 
 
   @vrozov 
   can you please approve this PR as there is a deadline for drill 1.14?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sachouche commented on a change in pull request #1361: DRILL-6579: Added sanity checks to the Parquet reader to avoid infini…

2018-07-10 Thread GitBox

sachouche commented on a change in pull request #1361: DRILL-6579: Added sanity 
checks to the Parquet reader to avoid infini…
URL: https://github.com/apache/drill/pull/1361#discussion_r201484528
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenNullableFixedEntryReader.java
 ##
 @@ -38,14 +39,16 @@
   /** {@inheritDoc} */
   @Override
   final VarLenColumnBulkEntry getEntry(int valuesToRead) {
-assert columnPrecInfo.precision >= 0 : "Fixed length precision cannot be 
lower than zero";
+Preconditions.checkArgument(columnPrecInfo.precision >= 0, "Fixed length 
precision cannot be lower than zero");
 
 Review comment:
   You are correct; pushed the check to the CTOR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] sachouche commented on a change in pull request #1361: DRILL-6579: Added sanity checks to the Parquet reader to avoid infini…

2018-07-10 Thread GitBox

sachouche commented on a change in pull request #1361: DRILL-6579: Added sanity 
checks to the Parquet reader to avoid infini…
URL: https://github.com/apache/drill/pull/1361#discussion_r201484409
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenFixedEntryReader.java
 ##
 @@ -37,14 +38,15 @@
   /** {@inheritDoc} */
   @Override
   final VarLenColumnBulkEntry getEntry(int valuesToRead) {
-assert columnPrecInfo.precision >= 0 : "Fixed length precision cannot be 
lower than zero";
+Preconditions.checkArgument(columnPrecInfo.precision >= 0, "Fixed length 
precision cannot be lower than zero");
 
 Review comment:
   Pushed the check to the CTOR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced the batch statistics logging enablement

2018-07-10 Thread GitBox

bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced 
the batch statistics logging enablement
URL: https://github.com/apache/drill/pull/1355#discussion_r201458543
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/record/RecordBatchStats.java
 ##
 @@ -100,6 +108,119 @@ private String getQueryId(FragmentContext _context) {
   }
   return "NA";
 }
+
+private boolean isBatchStatsEnabledForOperator(FragmentContext context, 
OperatorContext oContext) {
+  // The configuration can select what operators should log batch 
statistics
+  final String statsLoggingOperator = 
context.getOptions().getString(ExecConstants.STATS_LOGGING_BATCH_OPERATOR_OPTION).toUpperCase();
+  final String allOperatorsStr = "ALL";
+
+  // All operators are allowed to log batch statistics
+  if (allOperatorsStr.equals(statsLoggingOperator)) {
+return true;
+  }
+
+  // No, only a select few are allowed; syntax: 
operator-id-1,operator-id-2,..
+  final String[] operators = statsLoggingOperator.split(",");
+  final String operatorId = oContext.getStats().getId().toUpperCase();
+
+  for (int idx = 0; idx < operators.length; idx++) {
+// We use "contains" because the operator identifier is a composite 
string; e.g., 3:[PARQUET_ROW_GROUP_SCAN]
+if (operatorId.contains(operators[idx])) {
+  return true;
+}
+  }
+
+  return false;
+}
+  }
+
+  /**
+   * @see {@link RecordBatchStats#logRecordBatchStats(String, RecordBatch, 
RecordBatchStatsContext)}
+   */
+  public static void logRecordBatchStats(RecordBatch recordBatch,
 
 Review comment:
   there seem to be no callers of this function. is this meant for operators 
which don't have a sourceId ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced the batch statistics logging enablement

2018-07-10 Thread GitBox

bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced 
the batch statistics logging enablement
URL: https://github.com/apache/drill/pull/1355#discussion_r201459812
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/record/RecordBatchStats.java
 ##
 @@ -100,6 +108,119 @@ private String getQueryId(FragmentContext _context) {
   }
   return "NA";
 }
+
+private boolean isBatchStatsEnabledForOperator(FragmentContext context, 
OperatorContext oContext) {
+  // The configuration can select what operators should log batch 
statistics
+  final String statsLoggingOperator = 
context.getOptions().getString(ExecConstants.STATS_LOGGING_BATCH_OPERATOR_OPTION).toUpperCase();
+  final String allOperatorsStr = "ALL";
+
+  // All operators are allowed to log batch statistics
+  if (allOperatorsStr.equals(statsLoggingOperator)) {
+return true;
+  }
+
+  // No, only a select few are allowed; syntax: 
operator-id-1,operator-id-2,..
+  final String[] operators = statsLoggingOperator.split(",");
+  final String operatorId = oContext.getStats().getId().toUpperCase();
+
+  for (int idx = 0; idx < operators.length; idx++) {
+// We use "contains" because the operator identifier is a composite 
string; e.g., 3:[PARQUET_ROW_GROUP_SCAN]
+if (operatorId.contains(operators[idx])) {
+  return true;
+}
+  }
+
+  return false;
+}
+  }
+
+  /**
+   * @see {@link RecordBatchStats#logRecordBatchStats(String, RecordBatch, 
RecordBatchStatsContext)}
+   */
+  public static void logRecordBatchStats(RecordBatch recordBatch,
+RecordBatchStatsContext batchStatsContext) {
+
+logRecordBatchStats(null, recordBatch, batchStatsContext);
+  }
+
+  /**
+   * Logs record batch statistics for the input record batch (logging happens 
only
+   * when record statistics logging is enabled).
+   *
+   * @param sourceId optional source identifier for scanners
+   * @param recordBatch a set of records
+   * @param batchStatsContext batch stats context object
+   */
+  public static void logRecordBatchStats(String sourceId,
+RecordBatch recordBatch,
+RecordBatchStatsContext batchStatsContext) {
+
+if (!batchStatsContext.isEnableBatchSzLogging()) {
+  return; // NOOP
+}
+
+final String statsId = batchStatsContext.getContextOperatorId();
+final boolean verbose = batchStatsContext.isEnableFgBatchSzLogging();
+final String msg = printRecordBatchStats(statsId, sourceId, recordBatch, 
verbose);
+
+logBatchStatsMsg(batchStatsContext, msg, false);
+  }
+
+  /**
+   * Logs a generic batch statistics message
+   *
+   * @param message log message
+   * @param batchStatsLogging
+   * @param batchStatsContext batch stats context object
+   */
+  public static void logRecordBatchStats(String message,
+RecordBatchStatsContext batchStatsContext) {
+
+if (!batchStatsContext.isEnableBatchSzLogging()) {
+  return; // NOOP
+}
+
+logBatchStatsMsg(batchStatsContext, message, true);
+  }
+
+  /**
+   * Prints a materialized field type
+   * @param field materialized field
+   * @param msg string builder where to append the field type
+   */
+  /*
+  public static void printFieldType(MaterializedField field, StringBuilder 
msg) {
 
 Review comment:
   commented-out function. please remove.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced the batch statistics logging enablement

2018-07-10 Thread GitBox

bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced 
the batch statistics logging enablement
URL: https://github.com/apache/drill/pull/1355#discussion_r201167369
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
 ##
 @@ -704,5 +704,8 @@ public static String bootDefaultFor(String name) {
   public static final String STATS_LOGGING_FG_BATCH_SIZE_OPTION = 
"drill.exec.stats.logging.fine_grained.batch_size";
   public static final BooleanValidator STATS_LOGGING_BATCH_FG_SIZE_VALIDATOR = 
new BooleanValidator(STATS_LOGGING_FG_BATCH_SIZE_OPTION);
 
+  /** Controls the list of operators for which batch sizing stats should be 
enabled */
 
 Review comment:
   Can you please explain the motivation for the naming hierarchy that you have 
chosen for this option ? I would suggest 
"drill.exec.stats.logging.batch_size.enabled_operators"


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced the batch statistics logging enablement

2018-07-10 Thread GitBox

bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced 
the batch statistics logging enablement
URL: https://github.com/apache/drill/pull/1355#discussion_r201441390
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/record/RecordBatchStats.java
 ##
 @@ -100,6 +108,119 @@ private String getQueryId(FragmentContext _context) {
   }
   return "NA";
 }
+
+private boolean isBatchStatsEnabledForOperator(FragmentContext context, 
OperatorContext oContext) {
+  // The configuration can select what operators should log batch 
statistics
+  final String statsLoggingOperator = 
context.getOptions().getString(ExecConstants.STATS_LOGGING_BATCH_OPERATOR_OPTION).toUpperCase();
+  final String allOperatorsStr = "ALL";
+
+  // All operators are allowed to log batch statistics
+  if (allOperatorsStr.equals(statsLoggingOperator)) {
+return true;
+  }
+
+  // No, only a select few are allowed; syntax: 
operator-id-1,operator-id-2,..
 
 Review comment:
   are the operator-ids in statsLoggingOperator  supposed to have "_"s in them 
? Otherwise, looks they will not match with something like 
"3:[PARQUET_ROW_GROUP_SCAN]"


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced the batch statistics logging enablement

2018-07-10 Thread GitBox

bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced 
the batch statistics logging enablement
URL: https://github.com/apache/drill/pull/1355#discussion_r201465077
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/util/record/RecordBatchStats.java
 ##
 @@ -100,6 +108,119 @@ private String getQueryId(FragmentContext _context) {
   }
   return "NA";
 }
+
+private boolean isBatchStatsEnabledForOperator(FragmentContext context, 
OperatorContext oContext) {
+  // The configuration can select what operators should log batch 
statistics
+  final String statsLoggingOperator = 
context.getOptions().getString(ExecConstants.STATS_LOGGING_BATCH_OPERATOR_OPTION).toUpperCase();
+  final String allOperatorsStr = "ALL";
+
+  // All operators are allowed to log batch statistics
+  if (allOperatorsStr.equals(statsLoggingOperator)) {
+return true;
+  }
+
+  // No, only a select few are allowed; syntax: 
operator-id-1,operator-id-2,..
+  final String[] operators = statsLoggingOperator.split(",");
+  final String operatorId = oContext.getStats().getId().toUpperCase();
+
+  for (int idx = 0; idx < operators.length; idx++) {
+// We use "contains" because the operator identifier is a composite 
string; e.g., 3:[PARQUET_ROW_GROUP_SCAN]
+if (operatorId.contains(operators[idx])) {
+  return true;
+}
+  }
+
+  return false;
+}
+  }
+
+  /**
+   * @see {@link RecordBatchStats#logRecordBatchStats(String, RecordBatch, 
RecordBatchStatsContext)}
+   */
+  public static void logRecordBatchStats(RecordBatch recordBatch,
+RecordBatchStatsContext batchStatsContext) {
+
+logRecordBatchStats(null, recordBatch, batchStatsContext);
+  }
+
+  /**
+   * Logs record batch statistics for the input record batch (logging happens 
only
+   * when record statistics logging is enabled).
+   *
+   * @param sourceId optional source identifier for scanners
+   * @param recordBatch a set of records
+   * @param batchStatsContext batch stats context object
+   */
+  public static void logRecordBatchStats(String sourceId,
 
 Review comment:
   Can 'sourceId' be renamed to scanSourceId ? 
   
   If an operator wants to print both incoming and outgoing batches, how can 
that be done in this logging framework in a way where you can distinguish 
between the both?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced the batch statistics logging enablement

2018-07-10 Thread GitBox

bitblender commented on a change in pull request #1355: DRILL-6560: Enhanced 
the batch statistics logging enablement
URL: https://github.com/apache/drill/pull/1355#discussion_r201186631
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VarLenBinaryReader.java
 ##
 @@ -34,11 +34,12 @@
 import 
org.apache.drill.exec.store.parquet.columnreaders.batchsizing.RecordBatchSizerManager.FieldOverflowState;
 import 
org.apache.drill.exec.store.parquet.columnreaders.batchsizing.RecordBatchSizerManager.FieldOverflowStateContainer;
 import 
org.apache.drill.exec.store.parquet.columnreaders.batchsizing.RecordBatchSizerManager.VarLenColumnBatchStats;
+import org.apache.drill.exec.util.record.RecordBatchStats;
 import org.apache.drill.exec.vector.ValueVector;
 
 /** Class which handles reading a batch of rows from a set of variable columns 
*/
 public class VarLenBinaryReader {
-  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(VarLenBinaryReader.class);
+//  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(VarLenBinaryReader.class);
 
 Review comment:
   please remove commented line


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua commented on issue #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua commented on issue #1371: DRILL-6588: Make Sys tables of nullable 
datatypes
URL: https://github.com/apache/drill/pull/1371#issuecomment-403914728
 
 
   @amansinha100 can you review this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] kkhatua opened a new pull request #1371: DRILL-6588: Make Sys tables of nullable datatypes

2018-07-10 Thread GitBox

kkhatua opened a new pull request #1371: DRILL-6588: Make Sys tables of 
nullable datatypes
URL: https://github.com/apache/drill/pull/1371
 
 
   This addresses the issue of columns in the System tables being marked as 
non-nullable. While these tables are immutable, they can carry nullable values 
as well. (e.g. `num_val` in `sys.options`)
   This commit introduces an annotation for the `PojoReader` that applies the 
nullable property if explicitly defined for a column (i.e. explicitly defined 
`isNullable` for a member of the POJO instance).
   ```
   apache drill 1.14.0-SNAPSHOT 
   "just drill it"
   0: jdbc:drill:schema=sys> select count(*) from sys.options where num_val is 
null;
   +-+
   | EXPR$0  |
   +-+
   | 108 |
   +-+
   1 row selected (2.703 seconds)
   0: jdbc:drill:schema=sys> select count(*) from sys.options where 
isnull(num_val);
   +-+
   | EXPR$0  |
   +-+
   | 108 |
   +-+
   1 row selected (0.3 seconds)
   0: jdbc:drill:schema=sys> select distinct is_nullable, count(*) from 
INFORMATION_SCHEMA.`COLUMNS` where table_schema = 'sys' group by is_nullable;
   +--+-+
   | is_nullable  | EXPR$1  |
   +--+-+
   | NO   | 36  |
   | YES  | 50  |
   +--+-+
   2 rows selected (0.69 seconds)
   0: jdbc:drill:schema=sys> describe options;
   +---++--+
   |COLUMN_NAME| DATA_TYPE  | IS_NULLABLE  |
   +---++--+
   | name  | CHARACTER VARYING  | NO   |
   | kind  | CHARACTER VARYING  | NO   |
   | accessibleScopes  | CHARACTER VARYING  | NO   |
   | optionScope   | CHARACTER VARYING  | NO   |
   | status| CHARACTER VARYING  | NO   |
   | num_val   | BIGINT | YES  |
   | string_val| CHARACTER VARYING  | YES  |
   | bool_val  | BOOLEAN| YES  |
   | float_val | DOUBLE | YES  |
   +---++--+
   9 rows selected (0.221 seconds)
   0: jdbc:drill:schema=sys> describe internal_options;
   +---++--+
   |COLUMN_NAME| DATA_TYPE  | IS_NULLABLE  |
   +---++--+
   | name  | CHARACTER VARYING  | NO   |
   | kind  | CHARACTER VARYING  | NO   |
   | accessibleScopes  | CHARACTER VARYING  | NO   |
   | optionScope   | CHARACTER VARYING  | NO   |
   | status| CHARACTER VARYING  | NO   |
   | num_val   | BIGINT | YES  |
   | string_val| CHARACTER VARYING  | YES  |
   | bool_val  | BOOLEAN| YES  |
   | float_val | DOUBLE | YES  |
   +---++--+
   9 rows selected (0.185 seconds)
   0: jdbc:drill:schema=sys> describe options_val;
   +---++--+
   |COLUMN_NAME| DATA_TYPE  | IS_NULLABLE  |
   +---++--+
   | name  | CHARACTER VARYING  | NO   |
   | kind  | CHARACTER VARYING  | NO   |
   | accessibleScopes  | CHARACTER VARYING  | NO   |
   | val   | CHARACTER VARYING  | YES  |
   | optionScope   | CHARACTER VARYING  | NO   |
   +---++--+
   5 rows selected (0.183 seconds)
   0: jdbc:drill:schema=sys> describe profiles_json;
   +--++--+
   | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
   +--++--+
   | queryId  | CHARACTER VARYING  | NO   |
   | json | CHARACTER VARYING  | YES  |
   +--++--+
   2 rows selected (0.727 seconds)
   0: jdbc:drill:schema=sys> describe profiles;
   +--++--+
   | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
   +--++--+
   | queryId  | CHARACTER VARYING  | NO   |
   | startTime| TIMESTAMP  | YES  |
   | foreman  | CHARACTER VARYING  | NO   |
   | fragments| BIGINT | YES  |
   | user | CHARACTER VARYING  | YES  |
   | queue| CHARACTER VARYING  | YES  |
   | planTime | BIGINT | YES  |
   | queueTime| BIGINT | YES  |
   | executeTime  | BIGINT | YES  |
   | totalTime| BIGINT | YES  |
   | state| CHARACTER VARYING  | YE

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201422339
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,70 +62,72 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
   private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics)exprStat).getMax()) {
+return RowsMatch.NONE;
+  }
+  return ((BooleanStatistics)exprStat).getMin() && 
((BooleanStatistics)exprStat).getMax() ? checkNull(exprStat) : RowsMatch.SOME;
+});
   }
 
   /**
* IS FALSE predicate.
*/
   private static LogicalExpression createIsFalsePredicate(LogicalExpression 
expr) {
 return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if min value is not false or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+  exprStat.hasNonNullValue() && ((BooleanStatistics)exprStat).getMin() || 
isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : 
checkNull(exprStat)
 
 Review comment:
   Is it necessary to check for max to be False before returning ALL? The same 
for "is not true/false".


This is an automated message from the Apache Git Service.
To respond to the message, please

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201420744
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,70 +62,72 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
   private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics)exprStat).getMax()) {
+return RowsMatch.NONE;
+  }
+  return ((BooleanStatistics)exprStat).getMin() && 
((BooleanStatistics)exprStat).getMax() ? checkNull(exprStat) : RowsMatch.SOME;
 
 Review comment:
   It is the same pattern: if min is True, max must be True. It is necessary to 
check for min only.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] arina-ielchiieva commented on issue #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

arina-ielchiieva commented on issue #1298: DRILL-5796: Filter pruning for multi 
rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#issuecomment-403894483
 
 
   @jbimbert I am asking to change existing code, just the parts you have made 
changes in.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on issue #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on issue #1298: DRILL-5796: Filter pruning for multi 
rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#issuecomment-403890029
 
 
@arina-ielchiieva  I have been asked to " keep the original format. Note 
that the new format makes it harder to review on the github." and when I 
applied checkstyle, I had to "revert back all formatting only changes." (see 
previous comment)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201412121
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && (!((BooleanStatistics)exprStat).getMin() && 
!((BooleanStatistics)exprStat).getMax())) {
 
 Review comment:
   OK. Suppressed getMin. Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201410502
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
 
 Review comment:
   C dropped. Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] arina-ielchiieva commented on issue #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

arina-ielchiieva commented on issue #1298: DRILL-5796: Filter pruning for multi 
rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#issuecomment-403882818
 
 
   I don't think that would be hard to fix, @jbimbert please make sure you 
correct checkstyle. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201406425
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
 
 Review comment:
   It is not a question of preference, but validity. What will it mean to use 
`ParquetIsPredicate.createIsTruePredicate()`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201405251
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && (!((BooleanStatistics)exprStat).getMin() && 
!((BooleanStatistics)exprStat).getMax())) {
 
 Review comment:
   min=True and max=False point to corrupted statistics, please see comparator 
(`Boolean.compare`) for BooleanStatistics. If you have an example, it is 
necessary to see why the statistics is corrupted. Note that if the order for 
Boolean is not properly defined, it is not valid to assume that if min is 
True(False) and max is True(False), the rest values are also True(False).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] vrozov commented on issue #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on issue #1298: DRILL-5796: Filter pruning for multi rowgroup 
parquet file
URL: https://github.com/apache/drill/pull/1298#issuecomment-403878332
 
 
   @arina-ielchiieva There are few formatting issues (like double spaces and 
missing spaces), but I am not paying attention to them as they are not enforced 
by the Drill checkstyle. If you want them to be fixed, please let @jbimbert 
know.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201400269
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java
 ##
 @@ -59,7 +59,7 @@ public ParquetMetaStatCollector(ParquetTableMetadataBase 
parquetTableMetadata,
 // Reasons to pass implicit columns and their values:
 // 1. Differentiate implicit columns from regular non-exist columns. 
Implicit columns do not
 //exist in parquet metadata. Without such knowledge, implicit columns 
is treated as non-exist
-//column.  A condition on non-exist column would lead to canDrop = 
true, which is not the
+//column.  A condition on non-exist column would lead to matches = 
true, which is not the
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201399967
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetFooterStatCollector.java
 ##
 @@ -59,7 +59,7 @@ public ParquetFooterStatCollector(ParquetMetadata footer, 
int rowGroupIndex, Map
 // Reasons to pass implicit columns and their values:
 // 1. Differentiate implicit columns from regular non-exist columns. 
Implicit columns do not
 //exist in parquet metadata. Without such knowledge, implicit columns 
is treated as non-exist
-//column.  A condition on non-exist column would lead to canDrop = 
true, which is not the
+//column.  A condition on non-exist column would lead to matches = 
true, which is not the
 
 Review comment:
   ALL. Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201398953
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetFooterStatCollector.java
 ##
 @@ -59,7 +59,7 @@ public ParquetFooterStatCollector(ParquetMetadata footer, 
int rowGroupIndex, Map
 // Reasons to pass implicit columns and their values:
 // 1. Differentiate implicit columns from regular non-exist columns. 
Implicit columns do not
 //exist in parquet metadata. Without such knowledge, implicit columns 
is treated as non-exist
-//column.  A condition on non-exist column would lead to canDrop = 
true, which is not the
+//column.  A condition on non-exist column would lead to matches = 
true, which is not the
 
 Review comment:
   I guess it is either ALL or NONE, but not `true`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201399120
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java
 ##
 @@ -59,7 +59,7 @@ public ParquetMetaStatCollector(ParquetTableMetadataBase 
parquetTableMetadata,
 // Reasons to pass implicit columns and their values:
 // 1. Differentiate implicit columns from regular non-exist columns. 
Implicit columns do not
 //exist in parquet metadata. Without such knowledge, implicit columns 
is treated as non-exist
-//column.  A condition on non-exist column would lead to canDrop = 
true, which is not the
+//column.  A condition on non-exist column would lead to matches = 
true, which is not the
 
 Review comment:
   The same as above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201395530
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && (!((BooleanStatistics)exprStat).getMin() && 
!((BooleanStatistics)exprStat).getMax())) {
+return RowsMatch.NONE;
+  }
+  return ((BooleanStatistics)exprStat).getMin() && 
((BooleanStatistics)exprStat).getMax() ? checkNull(exprStat) : RowsMatch.SOME;
+});
   }
 
   /**
* IS FALSE predicate.
*/
-  private static LogicalExpression createIsFalsePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if min value is not false or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+  private static > LogicalExpression 
createIsFalsePredicate(LogicalExpression expr) {
 
 Review comment:
   See previous comment


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use t

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201168272
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ##
 @@ -154,83 +188,174 @@ public void buildSchema() throws SchemaChangeException {
   public IterOutcome innerNext() {
 
 // if a special batch has been sent, we have no data in the incoming so 
exit early
-if (specialBatchSent) {
-  return IterOutcome.NONE;
+if ( done || specialBatchSent) {
+  return NONE;
+}
+
+// We sent an OK_NEW_SCHEMA and also encountered the end of a data set. So 
we need to send
+// an EMIT with an empty batch now
+if (sendEmit) {
+  sendEmit = false;
+  firstBatchForDataSet = true;
+  recordCount = 0;
+  return EMIT;
 }
 
 // this is only called on the first batch. Beyond this, the aggregator 
manages batches.
 if (aggregator == null || first) {
-  IterOutcome outcome;
   if (first && incoming.getRecordCount() > 0) {
 first = false;
-outcome = IterOutcome.OK_NEW_SCHEMA;
+lastKnownOutcome = OK_NEW_SCHEMA;
   } else {
-outcome = next(incoming);
+lastKnownOutcome = next(incoming);
   }
-  logger.debug("Next outcome of {}", outcome);
-  switch (outcome) {
-  case NONE:
-if (first && popConfig.getKeys().size() == 0) {
+  logger.debug("Next outcome of {}", lastKnownOutcome);
+  switch (lastKnownOutcome) {
+case NONE:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OUT_OF_MEMORY:
+case NOT_YET:
+case STOP:
+  return lastKnownOutcome;
+case OK_NEW_SCHEMA:
+  if (!createAggregator()) {
+done = true;
+return IterOutcome.STOP;
+  }
+  break;
+case EMIT:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+firstBatchForDataSet = true; // reset on the next iteration
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OK:
+  break;
+default:
+  throw new IllegalStateException(String.format("unknown outcome %s", 
lastKnownOutcome));
+  }
+} else {
+  if ( lastKnownOutcome != NONE && firstBatchForDataSet && 
!aggregator.isDone()) {
+lastKnownOutcome = incoming.next();
+if (!first && firstBatchForDataSet) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201170448
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java
 ##
 @@ -189,83 +209,128 @@ else if (isSame( previousIndex, currentIndex )) {
   logger.debug("Received IterOutcome of {}", out);
 }
 switch (out) {
-case NONE:
-  done = true;
-  lastOutcome = out;
-  if (first && addedRecordCount == 0) {
-return setOkAndReturn();
-  } else if (addedRecordCount > 0) {
-outputToBatchPrev(previous, previousIndex, outputCount); // No 
need to check the return value
-// (output container full or not) as we are not going to 
insert any more records.
-if (EXTRA_DEBUG) {
-  logger.debug("Received no more batches, returning.");
+  case NONE:
+done = true;
+lastOutcome = out;
+if (firstBatchForDataSet && addedRecordCount == 0) {
+  return setOkAndReturn(out);
+} else if (addedRecordCount > 0) {
+  outputToBatchPrev(previous, previousIndex, outputCount); // 
No need to check the return value
+  // (output container full or not) as we are not going to 
insert any more records.
+  if (EXTRA_DEBUG) {
+logger.debug("Received no more batches, returning.");
+  }
+  return setOkAndReturn(out);
+} else {
+  // not first batch and record Count == 0
+  outcome = out;
+  return AggOutcome.CLEANUP_AND_RETURN;
 }
-return setOkAndReturn();
-  } else {
-if (first && out == IterOutcome.OK) {
-  out = IterOutcome.OK_NEW_SCHEMA;
+// EMIT is handled like OK, except that we do not loop back to 
process the
+// next incoming batch; we return instead
+  case EMIT:
+if (incoming.getRecordCount() == 0) {
+  if (addedRecordCount > 0) {
+outputToBatchPrev(previous, previousIndex, outputCount);
+  }
+  resetIndex();
+  return setOkAndReturn(out);
+} else {
+  resetIndex();
+  if (previousIndex != -1 && isSamePrev(previousIndex, 
previous, currentIndex)) {
+if (EXTRA_DEBUG) {
+  logger.debug("New value was same as last value of 
previous batch, adding.");
+}
+addRecordInc(currentIndex);
+previousIndex = currentIndex;
+incIndex();
+if (EXTRA_DEBUG) {
+  logger.debug("Continuing outside");
+}
+processRemainingRecordsInBatch();
+// currentIndex has been reset to int_max so use previous 
index.
+outputToBatch(previousIndex);
+resetIndex();
+return setOkAndReturn(out);
+  } else { // not the same
+if (EXTRA_DEBUG) {
+  logger.debug("This is not the same as the previous, add 
record and continue outside.");
+}
+if (addedRecordCount > 0) {
+  if (outputToBatchPrev(previous, previousIndex, 
outputCount)) {
+if (EXTRA_DEBUG) {
+  logger.debug("Output container is full. flushing 
it.");
+}
+return setOkAndReturn(out);
+  }
+}
+previousIndex = -1;
+processRemainingRecordsInBatch();
+outputToBatch(previousIndex); // currentIndex has been 
reset to int_max so use previous index.
+resetIndex();
+return setOkAndReturn(out);
+  }
 }
-outcome = out;
-return AggOutcome.CLEANUP_AND_RETURN;
-  }
-
-case NOT_YET:
-  this.outcome = out;
-  return AggOutcome.RETURN_OUTCOME;
-
-case OK_NEW_SCHEMA:
-  if (EXTRA_DEBUG) {
-logger.debug("Received new schema.  Batch has {} records.", 
incoming.getRecordCount());
-  }
-  if (addedRecordCount > 0) {
-outputToBatchPrev(previous, previousIndex, outputCount); // No 
need to check the return value
-// (output container full or not) as we are not going to 
insert anymore records.

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201170343
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java
 ##
 @@ -189,83 +209,128 @@ else if (isSame( previousIndex, currentIndex )) {
   logger.debug("Received IterOutcome of {}", out);
 }
 switch (out) {
-case NONE:
-  done = true;
-  lastOutcome = out;
-  if (first && addedRecordCount == 0) {
-return setOkAndReturn();
-  } else if (addedRecordCount > 0) {
-outputToBatchPrev(previous, previousIndex, outputCount); // No 
need to check the return value
-// (output container full or not) as we are not going to 
insert any more records.
-if (EXTRA_DEBUG) {
-  logger.debug("Received no more batches, returning.");
+  case NONE:
+done = true;
+lastOutcome = out;
+if (firstBatchForDataSet && addedRecordCount == 0) {
+  return setOkAndReturn(out);
+} else if (addedRecordCount > 0) {
+  outputToBatchPrev(previous, previousIndex, outputCount); // 
No need to check the return value
+  // (output container full or not) as we are not going to 
insert any more records.
+  if (EXTRA_DEBUG) {
+logger.debug("Received no more batches, returning.");
+  }
+  return setOkAndReturn(out);
+} else {
+  // not first batch and record Count == 0
+  outcome = out;
+  return AggOutcome.CLEANUP_AND_RETURN;
 }
-return setOkAndReturn();
-  } else {
-if (first && out == IterOutcome.OK) {
-  out = IterOutcome.OK_NEW_SCHEMA;
+// EMIT is handled like OK, except that we do not loop back to 
process the
+// next incoming batch; we return instead
+  case EMIT:
+if (incoming.getRecordCount() == 0) {
+  if (addedRecordCount > 0) {
+outputToBatchPrev(previous, previousIndex, outputCount);
+  }
+  resetIndex();
+  return setOkAndReturn(out);
+} else {
+  resetIndex();
+  if (previousIndex != -1 && isSamePrev(previousIndex, 
previous, currentIndex)) {
+if (EXTRA_DEBUG) {
+  logger.debug("New value was same as last value of 
previous batch, adding.");
+}
+addRecordInc(currentIndex);
+previousIndex = currentIndex;
+incIndex();
+if (EXTRA_DEBUG) {
+  logger.debug("Continuing outside");
+}
+processRemainingRecordsInBatch();
+// currentIndex has been reset to int_max so use previous 
index.
+outputToBatch(previousIndex);
+resetIndex();
+return setOkAndReturn(out);
+  } else { // not the same
+if (EXTRA_DEBUG) {
+  logger.debug("This is not the same as the previous, add 
record and continue outside.");
+}
+if (addedRecordCount > 0) {
+  if (outputToBatchPrev(previous, previousIndex, 
outputCount)) {
+if (EXTRA_DEBUG) {
+  logger.debug("Output container is full. flushing 
it.");
+}
+return setOkAndReturn(out);
+  }
+}
+previousIndex = -1;
+processRemainingRecordsInBatch();
+outputToBatch(previousIndex); // currentIndex has been 
reset to int_max so use previous index.
+resetIndex();
+return setOkAndReturn(out);
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201170630
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/lateraljoin/TestE2EUnnestAndLateral.java
 ##
 @@ -394,4 +394,47 @@ public void testLateral_HashAgg_with_nulls() throws 
Exception {
   .baselineValues("dd",222L)
   .build().run();
   }
+
+  @Test
+  public void testMultipleBatchesLateral_WithStreamingAgg() throws Exception {
+String sql = "SELECT t2.maxprice FROM (SELECT customer.c_orders AS 
c_orders FROM "
++ "dfs.`lateraljoin/multipleFiles/` customer) t1, LATERAL (SELECT 
CAST(MAX(t.ord.o_totalprice)"
++ " AS int) AS maxprice FROM UNNEST(t1.c_orders) t(ord) GROUP BY 
t.ord.o_orderstatus) t2";
+
+testBuilder()
+.optionSettingQueriesForTestQuery("alter session set `%s` = true",
+PlannerSettings.STREAMAGG.getOptionName())
+.sqlQuery(sql)
+.unOrdered()
+.baselineColumns("maxprice")
+.baselineValues(367190)
+.baselineValues(316347)
+.baselineValues(146610)
+.baselineValues(306996)
+.baselineValues(235695)
+.baselineValues(177819)
+.build().run();
+  }
+
+  @Test
+  public void testLateral_StreamingAgg_with_nulls() throws Exception {
+String sql = "SELECT key, t3.dsls FROM cp.`lateraljoin/with_nulls.json` t 
LEFT OUTER "
++ "JOIN LATERAL (SELECT DISTINCT t2.sls AS dsls FROM UNNEST(t.sales) 
t2(sls)) t3 ON TRUE";
+
+testBuilder()
+.optionSettingQueriesForTestQuery("alter session set `%s` = true",
+PlannerSettings.STREAMAGG.getOptionName())
+.sqlQuery(sql)
+.unOrdered()
+.baselineColumns("key","dsls")
+.baselineValues("aa",null)
+.baselineValues("bb",100L)
+.baselineValues("bb",200L)
+.baselineValues("bb",300L)
+.baselineValues("bb",400L)
+.baselineValues("cc",null)
+.baselineValues("dd",111L)
+.baselineValues("dd",222L)
+.build().run();
+  }
 }
 
 Review comment:
   Excellent suggestion. Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201170290
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java
 ##
 @@ -189,83 +209,128 @@ else if (isSame( previousIndex, currentIndex )) {
   logger.debug("Received IterOutcome of {}", out);
 }
 switch (out) {
-case NONE:
-  done = true;
-  lastOutcome = out;
-  if (first && addedRecordCount == 0) {
-return setOkAndReturn();
-  } else if (addedRecordCount > 0) {
-outputToBatchPrev(previous, previousIndex, outputCount); // No 
need to check the return value
-// (output container full or not) as we are not going to 
insert any more records.
-if (EXTRA_DEBUG) {
-  logger.debug("Received no more batches, returning.");
+  case NONE:
+done = true;
+lastOutcome = out;
+if (firstBatchForDataSet && addedRecordCount == 0) {
+  return setOkAndReturn(out);
+} else if (addedRecordCount > 0) {
+  outputToBatchPrev(previous, previousIndex, outputCount); // 
No need to check the return value
+  // (output container full or not) as we are not going to 
insert any more records.
+  if (EXTRA_DEBUG) {
+logger.debug("Received no more batches, returning.");
+  }
+  return setOkAndReturn(out);
+} else {
+  // not first batch and record Count == 0
+  outcome = out;
+  return AggOutcome.CLEANUP_AND_RETURN;
 }
-return setOkAndReturn();
-  } else {
-if (first && out == IterOutcome.OK) {
-  out = IterOutcome.OK_NEW_SCHEMA;
+// EMIT is handled like OK, except that we do not loop back to 
process the
+// next incoming batch; we return instead
+  case EMIT:
+if (incoming.getRecordCount() == 0) {
+  if (addedRecordCount > 0) {
+outputToBatchPrev(previous, previousIndex, outputCount);
+  }
+  resetIndex();
+  return setOkAndReturn(out);
+} else {
+  resetIndex();
+  if (previousIndex != -1 && isSamePrev(previousIndex, 
previous, currentIndex)) {
+if (EXTRA_DEBUG) {
+  logger.debug("New value was same as last value of 
previous batch, adding.");
+}
+addRecordInc(currentIndex);
+previousIndex = currentIndex;
+incIndex();
+if (EXTRA_DEBUG) {
+  logger.debug("Continuing outside");
+}
+processRemainingRecordsInBatch();
+// currentIndex has been reset to int_max so use previous 
index.
+outputToBatch(previousIndex);
+resetIndex();
+return setOkAndReturn(out);
+  } else { // not the same
+if (EXTRA_DEBUG) {
+  logger.debug("This is not the same as the previous, add 
record and continue outside.");
+}
+if (addedRecordCount > 0) {
+  if (outputToBatchPrev(previous, previousIndex, 
outputCount)) {
+if (EXTRA_DEBUG) {
+  logger.debug("Output container is full. flushing 
it.");
+}
 
 Review comment:
   Yes it does appear out of place but  it isn't really. It actually is tied to 
the fact that we must reset several state variables (this being one of them) 
every time EMIT is processed.
   Also, while debugging I found places where I had missed resetting the 
previousIndex, so this is a safe place to do so. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201168130
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ##
 @@ -154,83 +188,174 @@ public void buildSchema() throws SchemaChangeException {
   public IterOutcome innerNext() {
 
 // if a special batch has been sent, we have no data in the incoming so 
exit early
-if (specialBatchSent) {
-  return IterOutcome.NONE;
+if ( done || specialBatchSent) {
+  return NONE;
+}
+
+// We sent an OK_NEW_SCHEMA and also encountered the end of a data set. So 
we need to send
+// an EMIT with an empty batch now
+if (sendEmit) {
+  sendEmit = false;
+  firstBatchForDataSet = true;
+  recordCount = 0;
+  return EMIT;
 }
 
 // this is only called on the first batch. Beyond this, the aggregator 
manages batches.
 if (aggregator == null || first) {
-  IterOutcome outcome;
   if (first && incoming.getRecordCount() > 0) {
 first = false;
-outcome = IterOutcome.OK_NEW_SCHEMA;
+lastKnownOutcome = OK_NEW_SCHEMA;
   } else {
-outcome = next(incoming);
+lastKnownOutcome = next(incoming);
   }
-  logger.debug("Next outcome of {}", outcome);
-  switch (outcome) {
-  case NONE:
-if (first && popConfig.getKeys().size() == 0) {
+  logger.debug("Next outcome of {}", lastKnownOutcome);
+  switch (lastKnownOutcome) {
+case NONE:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OUT_OF_MEMORY:
+case NOT_YET:
+case STOP:
+  return lastKnownOutcome;
+case OK_NEW_SCHEMA:
+  if (!createAggregator()) {
+done = true;
+return IterOutcome.STOP;
+  }
+  break;
+case EMIT:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+firstBatchForDataSet = true; // reset on the next iteration
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201170512
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java
 ##
 @@ -297,16 +413,33 @@ private final void resetIndex() {
 incIndex();
   }
 
-  private final AggOutcome setOkAndReturn() {
-if (first) {
-  this.outcome = IterOutcome.OK_NEW_SCHEMA;
+  /**
+   * Set the outcome to OK (or OK_NEW_SCHEMA) and return the AggOutcome 
parameter
+   * @param outcome
+   * @return outcome
+   */
+  private final AggOutcome setOkAndReturn( IterOutcome outcome) {
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201393681
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
 
 Review comment:
   We can get rid of C parameter in createIsTruePredicate, 
createIsFalsePredicate, createIsNotTruePredicate, createIsNotFalsePredicate.
   example: private static LogicalExpression 
createIsFalsePredicate(LogicalExpression expr) {
   return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
 exprStat.hasNonNullValue() && ((BooleanStatistics)exprStat).getMin() 
|| isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.NONE : 
checkNull(exprStat)
   );
 }
   
   Do you prefer this version?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201168310
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ##
 @@ -154,83 +188,174 @@ public void buildSchema() throws SchemaChangeException {
   public IterOutcome innerNext() {
 
 // if a special batch has been sent, we have no data in the incoming so 
exit early
-if (specialBatchSent) {
-  return IterOutcome.NONE;
+if ( done || specialBatchSent) {
+  return NONE;
+}
+
+// We sent an OK_NEW_SCHEMA and also encountered the end of a data set. So 
we need to send
+// an EMIT with an empty batch now
+if (sendEmit) {
+  sendEmit = false;
+  firstBatchForDataSet = true;
+  recordCount = 0;
+  return EMIT;
 }
 
 // this is only called on the first batch. Beyond this, the aggregator 
manages batches.
 if (aggregator == null || first) {
-  IterOutcome outcome;
   if (first && incoming.getRecordCount() > 0) {
 first = false;
-outcome = IterOutcome.OK_NEW_SCHEMA;
+lastKnownOutcome = OK_NEW_SCHEMA;
   } else {
-outcome = next(incoming);
+lastKnownOutcome = next(incoming);
   }
-  logger.debug("Next outcome of {}", outcome);
-  switch (outcome) {
-  case NONE:
-if (first && popConfig.getKeys().size() == 0) {
+  logger.debug("Next outcome of {}", lastKnownOutcome);
+  switch (lastKnownOutcome) {
+case NONE:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OUT_OF_MEMORY:
+case NOT_YET:
+case STOP:
+  return lastKnownOutcome;
+case OK_NEW_SCHEMA:
+  if (!createAggregator()) {
+done = true;
+return IterOutcome.STOP;
+  }
+  break;
+case EMIT:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+firstBatchForDataSet = true; // reset on the next iteration
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OK:
+  break;
+default:
+  throw new IllegalStateException(String.format("unknown outcome %s", 
lastKnownOutcome));
+  }
+} else {
+  if ( lastKnownOutcome != NONE && firstBatchForDataSet && 
!aggregator.isDone()) {
+lastKnownOutcome = incoming.next();
+if (!first && firstBatchForDataSet) {
+  //Setup needs to be called again. During setup, generated code saves 
a reference to the vectors
+  // pointed to by the incoming batch so that the dereferencing of the 
vector wrappers to get to
+  // the vectors  does not have to be done at each call to eval. 
However, after an EMIT is seen,
+  // the vectors are replaced and the reference to the old vectors is 
no longer valid
+  try {
+aggregator.setup(oContext, incoming, this);
+  } catch (SchemaChangeException e) {
+UserException.Builder exceptionBuilder = 
UserException.functionError(e)
+.message("A Schema change exception occured in calling setup() 
in generated code.");
+throw exceptionBuilder.build(logger);
+  }
+}
+  }
+  // We sent an EMIT in the previous iteration, so we must be starting a 
new data set
+  if (firstBatchForDataSet) {
+done = false;
+sendEmit = false;
+specialBatchSent = false;
+firstBatchForDataSet = false;
+  }
+}
+AggOutcome aggOutcome = aggregator.doWork(lastKnownOutcome);
+recordCount = aggregator.getOutputCount();
+container.setRecordCount(recordCount);
+logger.debug("Ag

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201170604
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/lateraljoin/TestE2EUnnestAndLateral.java
 ##
 @@ -394,4 +394,47 @@ public void testLateral_HashAgg_with_nulls() throws 
Exception {
   .baselineValues("dd",222L)
   .build().run();
   }
+
+  @Test
+  public void testMultipleBatchesLateral_WithStreamingAgg() throws Exception {
+String sql = "SELECT t2.maxprice FROM (SELECT customer.c_orders AS 
c_orders FROM "
++ "dfs.`lateraljoin/multipleFiles/` customer) t1, LATERAL (SELECT 
CAST(MAX(t.ord.o_totalprice)"
++ " AS int) AS maxprice FROM UNNEST(t1.c_orders) t(ord) GROUP BY 
t.ord.o_orderstatus) t2";
+
+testBuilder()
+.optionSettingQueriesForTestQuery("alter session set `%s` = true",
+PlannerSettings.STREAMAGG.getOptionName())
+.sqlQuery(sql)
+.unOrdered()
+.baselineColumns("maxprice")
+.baselineValues(367190)
+.baselineValues(316347)
+.baselineValues(146610)
+.baselineValues(306996)
+.baselineValues(235695)
+.baselineValues(177819)
+.build().run();
+  }
+
+  @Test
+  public void testLateral_StreamingAgg_with_nulls() throws Exception {
+String sql = "SELECT key, t3.dsls FROM cp.`lateraljoin/with_nulls.json` t 
LEFT OUTER "
++ "JOIN LATERAL (SELECT DISTINCT t2.sls AS dsls FROM UNNEST(t.sales) 
t2(sls)) t3 ON TRUE";
+
+testBuilder()
+.optionSettingQueriesForTestQuery("alter session set `%s` = true",
+PlannerSettings.STREAMAGG.getOptionName())
 
 Review comment:
   Ditto


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201170576
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/lateraljoin/TestE2EUnnestAndLateral.java
 ##
 @@ -394,4 +394,47 @@ public void testLateral_HashAgg_with_nulls() throws 
Exception {
   .baselineValues("dd",222L)
   .build().run();
   }
+
+  @Test
+  public void testMultipleBatchesLateral_WithStreamingAgg() throws Exception {
+String sql = "SELECT t2.maxprice FROM (SELECT customer.c_orders AS 
c_orders FROM "
++ "dfs.`lateraljoin/multipleFiles/` customer) t1, LATERAL (SELECT 
CAST(MAX(t.ord.o_totalprice)"
++ " AS int) AS maxprice FROM UNNEST(t1.c_orders) t(ord) GROUP BY 
t.ord.o_orderstatus) t2";
+
+testBuilder()
+.optionSettingQueriesForTestQuery("alter session set `%s` = true",
+PlannerSettings.STREAMAGG.getOptionName())
 
 Review comment:
   Had to do this because the HashAgg tests were disabling the streaming agg. 
Removed this and reset the option in the HashAgg tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201168485
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ##
 @@ -308,8 +433,7 @@ public void addComplexWriter(final 
BaseWriter.ComplexWriter writer) {
   private StreamingAggregator createAggregatorInternal() throws 
SchemaChangeException, ClassTransformationException, IOException{
 ClassGenerator cg = 
CodeGenerator.getRoot(StreamingAggTemplate.TEMPLATE_DEFINITION, 
context.getOptions());
 cg.getCodeGenerator().plainJavaCapable(true);
-// Uncomment out this line to debug the generated code.
-//cg.getCodeGenerator().saveCodeForDebugging(true);
+//  cg.getCodeGenerator().saveCodeForDebugging(true);
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201168089
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ##
 @@ -154,83 +188,174 @@ public void buildSchema() throws SchemaChangeException {
   public IterOutcome innerNext() {
 
 // if a special batch has been sent, we have no data in the incoming so 
exit early
-if (specialBatchSent) {
-  return IterOutcome.NONE;
+if ( done || specialBatchSent) {
+  return NONE;
+}
+
+// We sent an OK_NEW_SCHEMA and also encountered the end of a data set. So 
we need to send
+// an EMIT with an empty batch now
+if (sendEmit) {
+  sendEmit = false;
+  firstBatchForDataSet = true;
+  recordCount = 0;
+  return EMIT;
 }
 
 // this is only called on the first batch. Beyond this, the aggregator 
manages batches.
 if (aggregator == null || first) {
-  IterOutcome outcome;
   if (first && incoming.getRecordCount() > 0) {
 first = false;
-outcome = IterOutcome.OK_NEW_SCHEMA;
+lastKnownOutcome = OK_NEW_SCHEMA;
   } else {
-outcome = next(incoming);
+lastKnownOutcome = next(incoming);
   }
-  logger.debug("Next outcome of {}", outcome);
-  switch (outcome) {
-  case NONE:
-if (first && popConfig.getKeys().size() == 0) {
+  logger.debug("Next outcome of {}", lastKnownOutcome);
+  switch (lastKnownOutcome) {
+case NONE:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OUT_OF_MEMORY:
+case NOT_YET:
+case STOP:
+  return lastKnownOutcome;
+case OK_NEW_SCHEMA:
+  if (!createAggregator()) {
+done = true;
+return IterOutcome.STOP;
+  }
+  break;
+case EMIT:
 
 Review comment:
   Done and then undone. The if condition for EMIT is different from the case 
for NONE and so this was a bug. The unit test you suggested caught it!  
Wonderful!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201169753
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ##
 @@ -496,6 +620,30 @@ private void getIndex(ClassGenerator 
g) {
 }
   }
 
+  private IterOutcome getFinalOutcome() {
+IterOutcome outcomeToReturn;
+
+if(firstBatchForDataSet) {
+  firstBatchForDataSet = false;
+}
+if (firstBatchForSchema) {
+  outcomeToReturn = OK_NEW_SCHEMA;
+  firstBatchForSchema = false;
+} else if (recordCount == 0) {
+  // get the outcome to return before calling refresh since that resets 
the lastKnowOutcome to OK
+  outcomeToReturn = lastKnownOutcome == EMIT ? EMIT : NONE;
+  if (outcomeToReturn == EMIT) {
+firstBatchForDataSet = true;
+  }
+} else if (lastKnownOutcome == EMIT) {
 
 Review comment:
   Done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] parthchandra commented on a change in pull request #1358: DRILL-6516: EMIT support in streaming agg

2018-07-10 Thread GitBox

parthchandra commented on a change in pull request #1358:  DRILL-6516: EMIT 
support in streaming agg
URL: https://github.com/apache/drill/pull/1358#discussion_r201168413
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ##
 @@ -154,83 +188,174 @@ public void buildSchema() throws SchemaChangeException {
   public IterOutcome innerNext() {
 
 // if a special batch has been sent, we have no data in the incoming so 
exit early
-if (specialBatchSent) {
-  return IterOutcome.NONE;
+if ( done || specialBatchSent) {
+  return NONE;
+}
+
+// We sent an OK_NEW_SCHEMA and also encountered the end of a data set. So 
we need to send
+// an EMIT with an empty batch now
+if (sendEmit) {
+  sendEmit = false;
+  firstBatchForDataSet = true;
+  recordCount = 0;
+  return EMIT;
 }
 
 // this is only called on the first batch. Beyond this, the aggregator 
manages batches.
 if (aggregator == null || first) {
-  IterOutcome outcome;
   if (first && incoming.getRecordCount() > 0) {
 first = false;
-outcome = IterOutcome.OK_NEW_SCHEMA;
+lastKnownOutcome = OK_NEW_SCHEMA;
   } else {
-outcome = next(incoming);
+lastKnownOutcome = next(incoming);
   }
-  logger.debug("Next outcome of {}", outcome);
-  switch (outcome) {
-  case NONE:
-if (first && popConfig.getKeys().size() == 0) {
+  logger.debug("Next outcome of {}", lastKnownOutcome);
+  switch (lastKnownOutcome) {
+case NONE:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OUT_OF_MEMORY:
+case NOT_YET:
+case STOP:
+  return lastKnownOutcome;
+case OK_NEW_SCHEMA:
+  if (!createAggregator()) {
+done = true;
+return IterOutcome.STOP;
+  }
+  break;
+case EMIT:
+  if (firstBatchForDataSet && popConfig.getKeys().size() == 0) {
+// if we have a straight aggregate and empty input batch, we need 
to handle it in a different way
+constructSpecialBatch();
+// set state to indicate the fact that we have sent a special 
batch and input is empty
+specialBatchSent = true;
+firstBatchForDataSet = true; // reset on the next iteration
+// If outcome is NONE then we send the special batch in the first 
iteration and the NONE
+// outcome in the next iteration. If outcome is EMIT, we can send 
the special
+// batch and the EMIT outcome at the same time.
+return getFinalOutcome();
+  }
+  // else fall thru
+case OK:
+  break;
+default:
+  throw new IllegalStateException(String.format("unknown outcome %s", 
lastKnownOutcome));
+  }
+} else {
+  if ( lastKnownOutcome != NONE && firstBatchForDataSet && 
!aggregator.isDone()) {
+lastKnownOutcome = incoming.next();
+if (!first && firstBatchForDataSet) {
+  //Setup needs to be called again. During setup, generated code saves 
a reference to the vectors
+  // pointed to by the incoming batch so that the dereferencing of the 
vector wrappers to get to
+  // the vectors  does not have to be done at each call to eval. 
However, after an EMIT is seen,
+  // the vectors are replaced and the reference to the old vectors is 
no longer valid
+  try {
+aggregator.setup(oContext, incoming, this);
+  } catch (SchemaChangeException e) {
+UserException.Builder exceptionBuilder = 
UserException.functionError(e)
+.message("A Schema change exception occured in calling setup() 
in generated code.");
+throw exceptionBuilder.build(logger);
+  }
+}
+  }
+  // We sent an EMIT in the previous iteration, so we must be starting a 
new data set
+  if (firstBatchForDataSet) {
+done = false;
+sendEmit = false;
+specialBatchSent = false;
+firstBatchForDataSet = false;
+  }
+}
+AggOutcome aggOutcome = aggregator.doWork(lastKnownOutcome);
+recordCount = aggregator.getOutputCount();
+container.setRecordCount(recordCount);
+logger.debug("Ag

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201391362
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && (!((BooleanStatistics)exprStat).getMin() && 
!((BooleanStatistics)exprStat).getMax())) {
+return RowsMatch.NONE;
+  }
+  return ((BooleanStatistics)exprStat).getMin() && 
((BooleanStatistics)exprStat).getMax() ? checkNull(exprStat) : RowsMatch.SOME;
+});
   }
 
   /**
* IS FALSE predicate.
*/
-  private static LogicalExpression createIsFalsePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if min value is not false or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && ((BooleanStatistics) exprStat).getMin()
+  private static > LogicalExpression 
createIsFalsePredicate(LogicalExpression expr) {
 
 Review comment:
   The same as for `createIsTruePredicate`.


This is an automated message from the Apache Git Service.
To respond to the message, please log o

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201391029
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
 
 Review comment:
   Can `C` be anything other than `Boolean`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] jbimbert commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

jbimbert commented on a change in pull request #1298: DRILL-5796: Filter 
pruning for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201390049
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && (!((BooleanStatistics)exprStat).getMin() && 
!((BooleanStatistics)exprStat).getMax())) {
 
 Review comment:
   I have found min=False and max=True, but also min=True and max=False. Hence 
I can not pre-suppose any order...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning for multi rowgroup parquet file

2018-07-10 Thread GitBox

vrozov commented on a change in pull request #1298: DRILL-5796: Filter pruning 
for multi rowgroup parquet file
URL: https://github.com/apache/drill/pull/1298#discussion_r201387426
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/stat/ParquetIsPredicate.java
 ##
 @@ -62,90 +62,90 @@ private ParquetIsPredicate(LogicalExpression expr, 
BiPredicate, Ra
 return visitor.visitUnknown(this, value);
   }
 
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  /**
+   * Apply the filter condition against the meta of the rowgroup.
+   */
+  public RowsMatch matches(RangeExprEvaluator evaluator) {
 Statistics exprStat = expr.accept(evaluator, null);
-if (isNullOrEmpty(exprStat)) {
-  return false;
-}
+return isNullOrEmpty(exprStat) ? RowsMatch.SOME : 
predicate.apply(exprStat, evaluator);
+  }
 
-return predicate.test(exprStat, evaluator);
+  /**
+   * After the applying of the filter against the statistics of the rowgroup, 
if the result is RowsMatch.ALL,
+   * then we still must know if the rowgroup contains some null values, 
because it can change the filter result.
+   * If it contains some null values, then we change the RowsMatch.ALL into 
RowsMatch.SOME, which sya that maybe
+   * some values (the null ones) should be disgarded.
+   */
+  private static RowsMatch checkNull(Statistics exprStat) {
+return hasNoNulls(exprStat) ? RowsMatch.ALL : RowsMatch.SOME;
   }
 
   /**
* IS NULL predicate.
*/
   private static > LogicalExpression 
createIsNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are no nulls  -> canDrop
-(exprStat, evaluator) -> hasNoNulls(exprStat)) {
-  private final boolean isArray = isArray(expr);
-
-  private boolean isArray(LogicalExpression expression) {
-if (expression instanceof TypedFieldExpr) {
-  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expression;
-  SchemaPath schemaPath = typedFieldExpr.getPath();
-  return schemaPath.isArray();
-}
-return false;
-  }
-
-  @Override
-  public boolean canDrop(RangeExprEvaluator evaluator) {
+  (exprStat, evaluator) -> {
 // for arrays we are not able to define exact number of nulls
 // [1,2,3] vs [1,2] -> in second case 3 is absent and thus it's null 
but statistics shows no nulls
-return !isArray && super.canDrop(evaluator);
-  }
-};
+if(expr instanceof TypedFieldExpr) {
+  TypedFieldExpr typedFieldExpr = (TypedFieldExpr) expr;
+  if (typedFieldExpr.getPath().isArray()) {
+return RowsMatch.SOME;
+  }
+}
+if (hasNoNulls(exprStat)) {
+  return RowsMatch.NONE;
+}
+return isAllNulls(exprStat, evaluator.getRowCount()) ? RowsMatch.ALL : 
RowsMatch.SOME;
+  });
   }
 
   /**
* IS NOT NULL predicate.
*/
   private static > LogicalExpression 
createIsNotNullPredicate(LogicalExpression expr) {
 return new ParquetIsPredicate(expr,
-//if there are all nulls  -> canDrop
-(exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount())
+  (exprStat, evaluator) -> isAllNulls(exprStat, evaluator.getRowCount()) ? 
RowsMatch.NONE : checkNull(exprStat)
 );
   }
 
   /**
* IS TRUE predicate.
*/
-  private static LogicalExpression createIsTruePredicate(LogicalExpression 
expr) {
-return new ParquetIsPredicate(expr, (exprStat, evaluator) ->
-//if max value is not true or if there are all nulls  -> canDrop
-isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && !((BooleanStatistics) exprStat).getMax()
-);
+  private static > LogicalExpression 
createIsTruePredicate(LogicalExpression expr) {
+return new ParquetIsPredicate(expr, (exprStat, evaluator) -> {
+  if (isAllNulls(exprStat, evaluator.getRowCount()) || 
exprStat.hasNonNullValue() && (!((BooleanStatistics)exprStat).getMin() && 
!((BooleanStatistics)exprStat).getMax())) {
 
 Review comment:
   For `BooleanStatistics` if `getMax()` is `false`, should not `getMin()` be 
also `false`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201328868
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
 
 Review comment:
   You do not need to store conf at class level.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201327153
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
+  + " cwd: " + Paths.get(".").toAbsolutePath().normalize().toString());
+  throw(ioe);
+}
+  }
+
+  @Test
+  public void testParquetReaderDecision() {
+List caseOldReader1 = new ArrayList<>();
+List caseOldReader2 = new ArrayList<>();
+List caseOldReader3 = new ArrayList<>();
+List caseNewReader1 = new ArrayList<>();
+List caseNewReader2 = new ArrayList<>();
+List caseNewReader3 = new ArrayList<>();
+
+SchemaPath topNestedPath = SchemaPath.getCompoundPath("marketing_info");
+SchemaPath nestedColumnPath = SchemaPath.getCompoundPath("marketing_info", 
"camp_id");
+SchemaPath topPath1 = SchemaPath.getCompoundPath("trans_id");
+SchemaPath topPath2 = SchemaPath.getCompoundPath("amount");
+SchemaPath nonExistentPath = SchemaPath.getCompoundPath("nonexistent");
+
+caseOldReader1.add(topNestedPath);
+caseOldReader2.add(nestedColumnPath);
+caseOldReader3.add(topPath1);
+caseOldReader3.add(nestedColumnPath);
+
+caseNewReader1.add(topPath1);
+caseNewReader2.add(topPath1);
+caseNewReader2.add(topPath2);
+
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader1));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader2));
+assertTrue("Complex column not detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseOldReader3));
+
+assertFalse("No complex column, but complex column is detected",
+AbstractParquetScanBatchCreator.containsComplexColumn(footer, 
caseNewReader1));
 
 Review comment:
   Though you have made method public static for testing, it's not a good 
practice, move method to utilities then. Semantically 
`AbstractParquetScanBatchCreator` definitely does not need to have this method 
as public static. As I have mentioned before `isComplex` and 
`containsComplexColumn` should be one method, might be utility one to ease 
testing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201326149
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderDecision.java
 ##
 @@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.store.parquet.AbstractParquetScanBatchCreator;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.List;
+import java.util.ArrayList;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+
+/**
+ * DRILL-5797 introduces more granularity on new reader use cases. This test 
is aimed at
+ * checking correctness of function used for new reader usage decision making.
+ */
+public class TestParquetReaderDecision {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
+conf = new Configuration();
+
+try {
+  footer = ParquetFileReader.readFooter(conf, new Path(path));
+} catch (IOException ioe) {
+  fail("Could not read Parquet file '" + path + "', error message: " + 
ioe.getMessage()
 
 Review comment:
   Please use `String.format`. Also you just passing error message but not the 
full stack trace. Which might be useful. I would not catch the exception at all 
and just add it to method signature. When the test fails, it would be obvious 
what happend.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201325527
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -191,4 +201,42 @@ protected AbstractDrillFileSystemManager(OperatorContext 
operatorContext) {
 protected abstract DrillFileSystem get(Configuration config, String path) 
throws ExecutionSetupException;
   }
 
+  /**
+   * Check whether any of queried columns is nested or repetitive.
+   *
+   * @param footer  Parquet file schema
+   * @param columns list of query SchemaPath objects
+   */
+  public static boolean containsComplexColumn(ParquetMetadata footer, 
List columns) {
+
+Map colDescMap = 
ParquetReaderUtility.getColNameToColumnDescriptorMapping(footer);
+Map schemaElements = 
ParquetReaderUtility.getColNameToSchemaElementMapping(footer);
+
+for (SchemaPath schemaPath : columns) {
+  // non-nested column check: full path must be equal to root segment path
+  if (!schemaPath.getUnIndexed().toString().replaceAll("`", "")
+  .equalsIgnoreCase(schemaPath.getRootSegment().getPath())) {
+logger.debug("Forcing 'old' reader due to nested column: {}", 
schemaPath.getUnIndexed().toString());
+return true;
+  }
+
+  // following column descriptor lookup failure may mean two cases, 
depending on subsequent SchemaElement lookup:
+  // 1. success: queried column is complex => use old reader
+  // 2. failure: queried column is not in schema => use new reader
+  ColumnDescriptor column = 
colDescMap.get(schemaPath.getUnIndexed().toString().toLowerCase());
+
+  if (column == null) {
+SchemaElement se = 
schemaElements.get(schemaPath.getUnIndexed().toString().toLowerCase());
 
 Review comment:
   Not in favor of naming variables as `se`, does not convey any meaning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201354650
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/AbstractParquetScanBatchCreator.java
 ##
 @@ -161,20 +167,24 @@ private ParquetMetadata readFooter(Configuration conf, 
String path) throws IOExc
 }
   }
 
-  private boolean isComplex(ParquetMetadata footer) {
+  private boolean isComplex(ParquetMetadata footer, List columns) {
 MessageType schema = footer.getFileMetaData().getSchema();
 
-for (Type type : schema.getFields()) {
-  if (!type.isPrimitive()) {
-return true;
+if (Utilities.isStarQuery(columns)) {
+  for (Type type : schema.getFields()) {
+if (!type.isPrimitive()) {
+  return true;
+}
   }
-}
-for (ColumnDescriptor col : schema.getColumns()) {
-  if (col.getMaxRepetitionLevel() > 0) {
-return true;
+  for (ColumnDescriptor col : schema.getColumns()) {
+if (col.getMaxRepetitionLevel() > 0) {
+  return true;
+}
   }
+  return false;
+} else {
+  return containsComplexColumn(footer, columns);
 
 Review comment:
   I based this one on original PR, but indeed it's a valid point. Will regroup 
accordingly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201352437
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderUtility.java
 ##
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.drill.exec.store.parquet.ParquetReaderUtility;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.format.SchemaElement;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.ParquetFileReader;
+
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.Map;
+
+public class TestParquetReaderUtility {
+
+  private static final String path = 
"src/test/resources/store/parquet/complex/complex.parquet";
+  private static Configuration conf;
+  private static ParquetMetadata footer;
+
+  @BeforeClass
+  public static void setUpBeforeClass() throws Exception {
 
 Review comment:
   Actually I took it from another test as I saw it in many other locations so 
considered maintaining the same naming convention. I'll correct it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

okalinin commented on a change in pull request #1370: DRILL-5797: Use Parquet 
new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201350337
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
+  iter.next();
+}
+while (iter.hasNext()) {
+  addSchemaElementMapping(iter, new StringBuilder(), schemaElements);
 }
 return schemaElements;
   }
 
+  /**
+   * Populate full path to SchemaElement map by recursively traversing schema 
elements referenced by the given iterator
+   *
+   * @param iter file schema values iterator
+   * @param path parent schema element path
+   * @param schemaElements schema elements map to insert next iterator element 
into
+   */
+  private static void addSchemaElementMapping(Iterator iter, StringBuilder 
path,
+  Map schemaElements) {
+SchemaElement se = (SchemaElement)iter.next();
+path.append('`').append(se.getName().toLowerCase()).append('`');
+schemaElements.put(path.toString(), se);
+
+int remainingChildren = se.getNum_children();
+
+while (remainingChildren > 0 && iter.hasNext()) {
 
 Review comment:
   `iterator.hasNext()` relates to whether there are more elements in flat tree 
representation and isin itself not enough to maintain correct recursion level. 
If we don't maintain/check remaining children count for parent element, schema
   `a`.`b`
   `a`.`c`
   will result into keys in the map:
   `a`
   `a`.`b`
   `a`.`b`.`c`
   since function will erratically enter recursion where it should not.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use Parquet new reader more often

2018-07-10 Thread GitBox

arina-ielchiieva commented on a change in pull request #1370: DRILL-5797: Use 
Parquet new reader more often
URL: https://github.com/apache/drill/pull/1370#discussion_r201345928
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ##
 @@ -140,15 +140,87 @@ public static int getIntFromLEBytes(byte[] input, int 
start) {
 return out;
   }
 
+  /**
+   * Map full schema paths in format `a`.`b`.`c` to respective SchemaElement 
objects.
+   *
+   * @param footer Parquet file metadata
+   * @return   schema full path to SchemaElement map
+   */
   public static Map 
getColNameToSchemaElementMapping(ParquetMetadata footer) {
-HashMap schemaElements = new HashMap<>();
+Map schemaElements = new HashMap<>();
 FileMetaData fileMetaData = new 
ParquetMetadataConverter().toParquetMetadata(ParquetFileWriter.CURRENT_VERSION, 
footer);
-for (SchemaElement se : fileMetaData.getSchema()) {
-  schemaElements.put(se.getName(), se);
+
+Iterator iter = fileMetaData.getSchema().iterator();
+
+// skip first default 'root' element
+if (iter.hasNext()) {
 
 Review comment:
   I think then you may skip the first element, but please add comment 
describing the reason.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

1 2 >

1 - 100 of 127 matches

Mail list logo