[GitHub] drill pull request #1073: DRILL-5967: Fixed memory leak in OrderedPartitionS...

2017-12-22 Thread ilooner
Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1073#discussion_r158577033
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/partitionsender/PartitionSenderRootExec.java
 ---
@@ -341,6 +351,10 @@ public void close() throws Exception {
   updateAggregateStats();
   partitioner.clear();
 }
+
+if (closeIncoming) {
+  ((CloseableRecordBatch) incoming).close();
+}
--- End diff --

Hi Paul, I understand the FragmentExecutor closes all the operators. 
However, the FragmentExecutor is only aware of the operators in the tree that 
it has a reference too. In the case of the OrderedPartitionSenderCreator, an 
upstream record batch is wrapped in an OrderedPartitionRecordBatch. Since the 
OrderedPartitionRecordBatch is instantiated in the creator the FragmentExecutor 
does not have a reference to it. So when the FragmentExecutor closes the 
operators it closes the original operator, but not not the wrapping 
OrderedPartitionRecordBatch. This is an issue since the 
OrderedPartitionRecordBatch allocates VectorContainers which are 
consequentially never released. This is clearly the source of the memory leak 
and the fix was verified by QA.

There are two possible Fixes. 

A) We change the Creators in some way to communicate to the 
FragmentExecutor that they have wrapped an operator, so the FragmentExecutor 
can close the wrapped operator instead of the original operator. My impression 
is that this will evolve into a fairly large change.

B) Or we take a less invasive approach and simply tell the 
PartitionSenderRootExec whether to close the wrapped operator (This is what I 
did).

I agree approach B is not that great and that approach A is cleaner. But 
taking approach A is a more intensive project, and I was not keen on having 
another simple change explode into another 100 file PR. It would be easier to 
make that change after we've made more improvements to Drill unit testing.

Let me know what you think. I am not against taking approach A, but then 
we'll have to talk with Pritesh about fitting that project in.


---


[jira] [Created] (DRILL-6059) Apply needed StoragePlugins's RuleSet to the planner

2017-12-22 Thread weijie.tong (JIRA)
weijie.tong created DRILL-6059:
--

 Summary: Apply needed StoragePlugins's RuleSet to the planner
 Key: DRILL-6059
 URL: https://issues.apache.org/jira/browse/DRILL-6059
 Project: Apache Drill
  Issue Type: Improvement
Reporter: weijie.tong
Assignee: weijie.tong
Priority: Minor
 Fix For: 1.13.0


Now once we configure Drill with more than one StoragePlugins, it will apply 
all the plugins's rules to user's queries even the queries not contain 
corresponding storage plugin. The reason is the method below of QueryContext
{code:java}
  public StoragePluginRegistry getStorage() {
return drillbitContext.getStorage();
  }
{code}
 
>From QueryContext's name , the method  should return the query involved 
>storage plugin registry not all the configured storage plugins. 

So we need to identify the involved storage plugin at the parse stage, and set 
the collected involved storage plugins to the QueryContext. This will also 
benefit the work to do a schema level security control. Maybe a new method with 
the name getInvolvedStorage will be more accurate.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-6058) Define per connection level OptionManager

2017-12-22 Thread weijie.tong (JIRA)
weijie.tong created DRILL-6058:
--

 Summary: Define per connection level OptionManager
 Key: DRILL-6058
 URL: https://issues.apache.org/jira/browse/DRILL-6058
 Project: Apache Drill
  Issue Type: Improvement
Reporter: weijie.tong
Assignee: weijie.tong
Priority: Minor
 Fix For: 1.13.0


We want to control some queries's running frequency which need to be identified 
by some options .
One requirement case is we want to control the download query times, but allow 
the normal queries to run. So we need to define a connection level 
OptionManager .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-6057) batch kill running queries from web UI

2017-12-22 Thread weijie.tong (JIRA)
weijie.tong created DRILL-6057:
--

 Summary: batch kill running queries from web UI
 Key: DRILL-6057
 URL: https://issues.apache.org/jira/browse/DRILL-6057
 Project: Apache Drill
  Issue Type: Improvement
  Components: Tools, Build & Test
Affects Versions: 1.13.0
 Environment: Some scenarios, Drill suffers  high concurrency queries 
and leads to overload. As an administrator, you might  want to quickly batch 
kill all the running queries to avoid the system from overload and let the 
system recover from being overloaded.Though we provide a web tool to  kill a 
running query one by one ,it's not so efficient to this case . 

So here will provide a web button to kill all the running queries to an 
administrator.

Reporter: weijie.tong
Assignee: weijie.tong
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2017-12-22 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r158549859
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VLAbstractEntryReader.java
 ---
@@ -0,0 +1,215 @@

+/***
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ 
**/
+package org.apache.drill.exec.store.parquet.columnreaders;
+
+import java.nio.ByteBuffer;
+
+import 
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.ColumnPrecisionInfo;
+import 
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.PageDataInfo;
+import org.apache.drill.exec.util.MemoryUtils;
+
+/** Abstract class for sub-classes implementing several algorithms for 
loading a Bulk Entry */
+abstract class VLAbstractEntryReader {
+
+  /** byte buffer used for buffering page data */
+  protected final ByteBuffer buffer;
+  /** Page Data Information */
+  protected final PageDataInfo pageInfo;
+  /** expected precision type: fixed or variable length */
+  protected final ColumnPrecisionInfo columnPrecInfo;
+  /** Bulk entry */
+  protected final VLColumnBulkEntry entry;
+
+  /**
+   * CTOR.
+   * @param _buffer byte buffer for data buffering (within CPU cache)
+   * @param _pageInfo page being processed information
+   * @param _columnPrecInfo column precision information
+   * @param _entry reusable bulk entry object
+   */
+  VLAbstractEntryReader(ByteBuffer _buffer,
+PageDataInfo _pageInfo,
+ColumnPrecisionInfo _columnPrecInfo,
+VLColumnBulkEntry _entry) {
+
+this.buffer = _buffer;
+this.pageInfo   = _pageInfo;
+this.columnPrecInfo = _columnPrecInfo;
+this.entry  = _entry;
+  }
+
+  /**
+   * @param valuesToRead maximum values to read within the current page
+   * @return a bulk entry object
+   */
+  abstract VLColumnBulkEntry getEntry(int valsToReadWithinPage);
+
+  /**
+   * Indicates whether to use bulk processing
+   */
+  protected final boolean bulkProcess() {
+return columnPrecInfo.bulkProcess;
+  }
+
+  /**
+   * Loads new data into the buffer if empty or the force flag is set.
+   *
+   * @param force flag to force loading new data into the buffer
+   */
+  protected final boolean load(boolean force) {
+
+if (!force && buffer.hasRemaining()) {
+  return true; // NOOP
+}
+
+// We're here either because the buffer is empty or we want to force a 
new load operation.
+// In the case of force, there might be unprocessed data (still in the 
buffer) which is fine
+// since the caller updates the page data buffer's offset only for the 
data it has consumed; this
+// means unread data will be loaded again but this time will be 
positioned in the beginning of the
+// buffer. This can happen only for the last entry in the buffer when 
either of its length or value
+// is incomplete.
+buffer.clear();
+
+int remaining = remainingPageData();
+int toCopy= remaining > buffer.capacity() ? buffer.capacity() : 
remaining;
+
+if (toCopy == 0) {
+  return false;
+}
+
+pageInfo.pageData.getBytes(pageInfo.pageDataOff, buffer.array(), 
buffer.position(), toCopy);
--- End diff --

So seriously, this is faster? I would have expected the copy from direct to 
java heap memory to be a big issue. There are HDFS APIs to read into ByteBuffer 
(not DirectByteBuffer) that we could leverage and reduce the memory copy across 
 direct memory and Java heap memory.  


---


[jira] [Created] (DRILL-6056) Mock datasize could overflow to negative

2017-12-22 Thread Chunhui Shi (JIRA)
Chunhui Shi created DRILL-6056:
--

 Summary: Mock datasize could overflow to negative
 Key: DRILL-6056
 URL: https://issues.apache.org/jira/browse/DRILL-6056
 Project: Apache Drill
  Issue Type: Task
Reporter: Chunhui Shi


In some cases, mock datasize (rowCount * rowWidth) could be too large, 
especially when we test spilling or memory OOB exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-6055) Session Multiplexing

2017-12-22 Thread Chunhui Shi (JIRA)
Chunhui Shi created DRILL-6055:
--

 Summary: Session Multiplexing
 Key: DRILL-6055
 URL: https://issues.apache.org/jira/browse/DRILL-6055
 Project: Apache Drill
  Issue Type: Task
Reporter: Chunhui Shi


We could allow one connection to carry multiple user sessions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-6054) Issues in FindPartitionConditions

2017-12-22 Thread Chunhui Shi (JIRA)
Chunhui Shi created DRILL-6054:
--

 Summary: Issues in FindPartitionConditions
 Key: DRILL-6054
 URL: https://issues.apache.org/jira/browse/DRILL-6054
 Project: Apache Drill
  Issue Type: Bug
Reporter: Chunhui Shi
Assignee: Chunhui Shi


When the condition is these cases, partition is not done correctly: 
b = 3 OR (dir0 = 1 and a = 2)
not (dir0 = 1 AND b = 2)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DRILL-6053) Avoid excessive locking in LocalPersistentStore

2017-12-22 Thread Vlad Rozov (JIRA)
Vlad Rozov created DRILL-6053:
-

 Summary: Avoid excessive locking in LocalPersistentStore
 Key: DRILL-6053
 URL: https://issues.apache.org/jira/browse/DRILL-6053
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Vlad Rozov
Assignee: Vlad Rozov


When query profiles are written to LocalPersistentStore, the write is 
unnecessary serialized due to read/write lock that was introduced for versioned 
PersistentStore. Only versioned access needs to be protected by read/write lock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] drill issue #1040: DRILL-5425: Support HTTP Kerberos auth using SPNEGO

2017-12-22 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1040
  
@sohami thanks for making the changes. +1, LGTM.


---


[GitHub] drill pull request #1071: DRILL-6028: Allow splitting generated code in Chai...

2017-12-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1071


---


[GitHub] drill pull request #1061: DRILL-5996: Add ability to re-run queries from Pro...

2017-12-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1061


---


[GitHub] drill issue #1071: DRILL-6028: Allow splitting generated code in ChainedHash...

2017-12-22 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1071
  
@paul-rogers thanks for the code review. Your approach seems to be worth 
trying. I have created Jira for enhancement [1].

[1] https://issues.apache.org/jira/browse/DRILL-6052


---


[jira] [Created] (DRILL-6052) Add hash(seed) method to each value vector

2017-12-22 Thread Arina Ielchiieva (JIRA)
Arina Ielchiieva created DRILL-6052:
---

 Summary: Add hash(seed) method to each value vector
 Key: DRILL-6052
 URL: https://issues.apache.org/jira/browse/DRILL-6052
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.12.0
Reporter: Arina Ielchiieva


As part of DRILL-6028 we has to enhance ChainedHashTable code generation to 
allow method splitting. Though as [~Paul.Rogers] has proposed there is an 
alternative way to reduce amount of generated code and improve performance:
{quote}
Thanks much for the example files and explanation for the need to hash.

The improvements look good. I wonder, however, if the code gen approach is 
overkill. There is exactly one allowable hash method per type. (Has to be the 
same for all queries to get reliable results.)

Here, we are generating code to do the work of:

Bind to all vectors.
Get a value out of the vector into a holder.
Pass the value to the proper hash function.
Combine the results.
The result is a huge amount of code to generate. The gist of this bug is that, 
when the number of columns becomes large, we generate so much code that we have 
to take extra steps to manage it. And, of course, compiling, caching and 
loading the code takes time.

As something to think about for the future, this entire mechanism can be 
replaced with a much simpler one:

Add a hash(seed) method to each value vector.
Here, iterate over the list of vectors.
Call the hash() method on each vector.
Combine the results.
Tradeoffs?

The proposed method has no setup cost. It is, instead an "interpreted" 
approach. The old method has a large setup cost.
The proposed method must make a "virtual" call into each vector to get the 
value and hash it using the pre-coded, type-specific function. The old method 
makes a direct call to get the value in a holder, then a direct call to the 
hash method.
The proposed method is insensitive to the number of columns (other than that it 
increases the size of the column loop.) The old method needs special care to 
handle the extra code.
The proposed method would be easy to test to see which is more efficient: 
(large code generation + direct method calls) vs. (no code generation and 
virtual method calls). My money is on the new method as it eliminates the 
holders, sloshing variables around and so on. The JIT can optimize the 
"pre-coded" methods once and for all early in the Drillbit run rather than 
having to re-optimize the (huge) generated methods per query.

The improvement is not necessary for this PR, but is something to think about. 
@Ben-Zvi may need something similar for the hash join to avoid generating 
query-specific key hashes. In fact, since hashing is used in many places 
(exchanges, hash agg, etc.), we might get quite a nice savings in time and code 
complexity by slowing moving to the proposed model.
{quote}

This Jira aims to apply proposed solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)