from:"sachouche"

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184807153
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

I see; I will then fix any such occurrences when opportunity presents 
itself as I have seen both patterns in the Drill code base.


---

[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-27 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
That was not my intention as my current change aimed at describing the 
system the way it is. 

@parthchandra, any feedback?


---

[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-27 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
@vrozov,

**What are we trying to solve / improve**
- Drill is currently not properly reporting memory held in Fragment's 
receive queues
- This makes it hard to analyze OOM conditions
This is what I want to see
- Every operator reporting on the resources it is currently using (needed)
- Fragment held resources (other than the ones already reported by the 
child operators)
- Drilbit level (metadata caches, web-server, ..)
- I am ok to incrementally reach this goal

**Data Exchange Logistic**
- Ideally, the data exchange fabric should be decoupled from the Drill 
Receive / Send operators
- The fabric should be handling all the aspects of pre-fetch / pressuring 
and so forth
- It will tune to the speed of producers / consumers when writing / reading 
data from it
- This infrastructure should have its own resource management and reporting 
capabilities

**Operator based Reporting**
- Receive and Send operators shall not worry about batches they didn't 
consume yet
- Doing so is counter productive as the Data Exchange fabric will interpret 
a "drain" operation as the operator "needing" more data. 
- For example, the merge-receiver should not be managing the receive 
queues; it should only advertise the pattern of data consumption and let the 
exchange fabric figure out the rest. 

The main difference in the two approaches, is that essentially, you are 
preaching for Receive and Send operators to control all aspects of 
communication whereas I am preaching for decoupling such aspects.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184747702
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/memory/AllocationManager.java
 ---
@@ -253,10 +261,12 @@ public boolean transferBalance(final BufferLedger 
target) {
   target.historicalLog.recordEvent("incoming(from %s)", 
owningLedger.allocator.name);
 }
 
-boolean overlimit = target.allocator.forceAllocate(size);
+// Release first to handle the case where the current and target 
allocators were part of the same
+// parent / child tree.
 allocator.releaseBytes(size);
+boolean allocationFit = target.allocator.forceAllocate(size);
--- End diff --

- The change of order is an optimization for a parent / child relationship 
as if we don't release first, then we could unnecessarily go over the memory 
budget (double counting).
- The force-alloc() / free() failures should never happen on normal 
conditions; when they do, the best thing to do is to exit. I still prefer not 
to promote the target allocator till it is 100% successful.




---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184727914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

Ok good point, as I have seen both practices being done within the Drill 
code. Though, I don't think this is a big deal as I don't see startWait() 
failing as it merely invokes nano time.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184730050
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java 
---
@@ -77,4 +83,46 @@ public long getByteCount() {
   public boolean isAckSent() {
 return ackSent.get();
   }
+
+  /**
+   * Transfer ownership of this DrillBuf to the target allocator. This is 
done for better memory
+   * accounting (that is, the operator should be charged with the body's 
Drillbuf memory).
+   *
+   * NOTES -
+   * 
+   * This operation is a NOOP when a) the current allocator 
(associated with the DrillBuf) is not the
+   * owning allocator or b) the target allocator is already the owner
+   * When transfer happens, a new RawFragmentBatch instance is 
allocated; this is done for proper
+   * DrillBuf reference count accounting
+   * The RPC handling code caches a reference to this RawFragmentBatch 
object instance; release()
+   * calls should be routed to the previous DrillBuf
+   * 
+   *
+   * @param targetAllocator target allocator
+   * @return a new {@link RawFragmentBatch} object instance on success 
(where the buffer ownership has
+   * been switched to the target allocator); otherwise this 
operation is a NOOP (current instance
+   * returned)
+   */
+  public RawFragmentBatch transferBodyOwnership(BufferAllocator 
targetAllocator) {
+if (body == null) {
+  return this; // NOOP
+}
+
+if (!body.getLedger().isOwningLedger()
+ || body.getLedger().isOwner(targetAllocator)) {
+
+  return this;
+}
+
+int writerIndex   = body.writerIndex();
+TransferResult transferResult = 
body.transferOwnership(targetAllocator);
+
+// Set the index and increment reference count
+transferResult.buffer.writerIndex(writerIndex);
+
+// Clear the current Drillbuffer since caller will perform release() 
on the new one
+body.release();
+
+return new RawFragmentBatch(getHeader(), transferResult.buffer, 
getSender(), false);
--- End diff --

We can take up such an enhancement as as part of another JIRA as any 
changes within the RPC layer have to be thoroughly tested.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184728292
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
--- End diff --

Ignore this comment as I thought you were releasing the returned batch.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-27 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184726839
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -201,6 +208,11 @@ public IterOutcome next() {
   context.getExecutorState().fail(ex);
   return IterOutcome.STOP;
 } finally {
+
+  if (batch != null) {
+batch.release();
+batch = null;
--- End diff --

The point of this pattern is that if you would like to continue using this 
object then be prepared to know what can and what cannot be used.


---

[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-26 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
@vrozov, your observation is valid, we need more JIRAs to fix the reporting 
problem
**Current Fix**
- At this time, the UnorderedReceiver didn't account for any consumed memory
- This fix, taxes the operator only when it consumes buffers

**Potential Enhancements**
Solution I
- Create a new fragment child Allocator which will own the received (not 
yet consumed) batches
- Improve the UI to report this allocator size
- This solution is simple and complementary to the current work

Solution II
- Have the receiver operator drain the batch queue
- Essentially, the receiver will have a private queue from where to consume 
batches
- I personally don't like this solution as it prevents us from improving 
the Drill network protocol
- Draining the batches for the sake of reporting will make it harder for 
the network layer to prefetch batches in an optimal manner; the queue size 
should be an indicator on how many pending batches there are. 


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184572864
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
--- End diff --

why is that? this method is used once and doing so will make us duplicate 
exception handling code. I would appreciate if you provide reason(s) when you 
make suggestions to avoid the back and forth.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184575686
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
+  && (!first || batch.getHeader().getDef().getFieldCount() == 0)) {
+batch = getNextBatch();
+  }
+} finally {
+  stats.stopWait();
+}
+return batch;
+  }
+
   @Override
   public IterOutcome next() {
 batchLoader.resetRecordCount();
 stats.startProcessing();
-try{
-  RawFragmentBatch batch;
-  try {
-stats.startWait();
-batch = getNextBatch();
 
-// skip over empty batches. we do this since these are basically 
control messages.
-while (batch != null && 
batch.getHeader().getDef().getRecordCount() == 0
-&& (!first || batch.getHeader().getDef().getFieldCount() == 
0)) {
-  batch = getNextBatch();
-}
-  } finally {
-stats.stopWait();
-  }
+RawFragmentBatch batch = null;
+try {
 
+  batch = getNextNotEmptyBatch();
   first = false;
 
   if (batch == null) {
--- End diff --

Sure.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184574622
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -201,6 +208,11 @@ public IterOutcome next() {
   context.getExecutorState().fail(ex);
   return IterOutcome.STOP;
 } finally {
+
+  if (batch != null) {
+batch.release();
+batch = null;
--- End diff --

This pattern ensures that after release the batch cannot be accessed 
anymore (e.g., additions to the current code).


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184574405
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
--- End diff --

Sure. 

FYI - We cannot release the batch within this method.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184573646
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

Sure, but does it really matter or is it your personal preference?


---

[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-25 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
@vrozov, I have implemented the provided suggestions.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184197379
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -153,8 +153,10 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
   public IterOutcome next() {
 batchLoader.resetRecordCount();
 stats.startProcessing();
+
+RawFragmentBatch batch = null;
 try{
-  RawFragmentBatch batch;
+
--- End diff --

will do. Thanks for the suggestion.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184197278
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/memory/AllocationManager.java
 ---
@@ -253,10 +261,12 @@ public boolean transferBalance(final BufferLedger 
target) {
   target.historicalLog.recordEvent("incoming(from %s)", 
owningLedger.allocator.name);
 }
 
-boolean overlimit = target.allocator.forceAllocate(size);
+// Release first to handle the case where the current and target 
allocators were part of the same
+// parent / child tree.
 allocator.releaseBytes(size);
+boolean allocationFit = target.allocator.forceAllocate(size);
--- End diff --

What about debugging:
- An operation fails
- And yet there is a change of ownership
 
@vrozov, I rather not make this change as I strongly believe the change of 
ownership should happen only on success.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184192630
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java 
---
@@ -77,4 +83,46 @@ public long getByteCount() {
   public boolean isAckSent() {
 return ackSent.get();
   }
+
+  /**
+   * Transfer ownership of this DrillBuf to the target allocator. This is 
done for better memory
+   * accounting (that is, the operator should be charged with the body's 
Drillbuf memory).
+   *
+   * NOTES -
+   * 
+   * This operation is a NOOP when a) the current allocator 
(associated with the DrillBuf) is not the
+   * owning allocator or b) the target allocator is already the owner
+   * When transfer happens, a new RawFragmentBatch instance is 
allocated; this is done for proper
+   * DrillBuf reference count accounting
+   * The RPC handling code caches a reference to this RawFragmentBatch 
object instance; release()
+   * calls should be routed to the previous DrillBuf
+   * 
+   *
+   * @param targetAllocator target allocator
+   * @return a new {@link RawFragmentBatch} object instance on success 
(where the buffer ownership has
+   * been switched to the target allocator); otherwise this 
operation is a NOOP (current instance
+   * returned)
+   */
+  public RawFragmentBatch transferBodyOwnership(BufferAllocator 
targetAllocator) {
+if (body == null) {
+  return this; // NOOP
+}
+
+if (!body.getLedger().isOwningLedger()
+ || body.getLedger().isOwner(targetAllocator)) {
+
+  return this;
+}
+
+int writerIndex   = body.writerIndex();
+TransferResult transferResult = 
body.transferOwnership(targetAllocator);
+
+// Set the index and increment reference count
+transferResult.buffer.writerIndex(writerIndex);
+
+// Clear the current Drillbuffer since caller will perform release() 
on the new one
+body.release();
+
+return new RawFragmentBatch(getHeader(), transferResult.buffer, 
getSender(), false);
--- End diff --

Look at the IncomingBuffers.batchArrived:
- It holds a reference to the RawFragmentBatch
- Holds a lock on the parent collector for insertion
- Though, the collector enqueue method itself is not synchronized on the 
"this" object
- Instead, it relies on the synchronization provided as part of the 
RawBatchBuffer queue
- This means a getNext() call will be able to dequeue a RawFragmentBatch 
instance
- Concurrently, the IncomingBuffers.batchArrived will invoke release on the 
cached reference


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184151186
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -182,13 +184,18 @@ public IterOutcome next() {
 return IterOutcome.OUT_OF_MEMORY;
   }
 
+  // Transfer the ownership of this raw-batch to this operator for 
proper memory statistics reporting
+  batch = batch.transferBodyOwnership(oContext.getAllocator());
+
   final RecordBatchDef rbd = batch.getHeader().getDef();
   final boolean schemaChanged = batchLoader.load(rbd, batch.getBody());
   // TODO:  Clean:  DRILL-2933:  That load(...) no longer throws
   // SchemaChangeException, so check/clean catch clause below.
   stats.addLongStat(Metric.BYTES_RECEIVED, batch.getByteCount());
 
   batch.release();
+  batch = null;
--- End diff --

@vrozov 

My bad, the highlight was on the "batch = null" code; I guess, you meant 
why can't we move the whole release logic within the finally block. I agree 
with your proposal as delaying a bit the release phase doesn't hurt u in this 
case. I will make the change.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184117938
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -182,13 +184,18 @@ public IterOutcome next() {
 return IterOutcome.OUT_OF_MEMORY;
   }
 
+  // Transfer the ownership of this raw-batch to this operator for 
proper memory statistics reporting
+  batch = batch.transferBodyOwnership(oContext.getAllocator());
+
   final RecordBatchDef rbd = batch.getHeader().getDef();
   final boolean schemaChanged = batchLoader.load(rbd, batch.getBody());
   // TODO:  Clean:  DRILL-2933:  That load(...) no longer throws
   // SchemaChangeException, so check/clean catch clause below.
   stats.addLongStat(Metric.BYTES_RECEIVED, batch.getByteCount());
 
   batch.release();
+  batch = null;
--- End diff --

Not sure what you mean but this is the goal of the current code:
- After a batch is properly set, we need to decrease the ref count by one 
by the end of the next() method
- If an exception happens before the release call, then the finally block 
will be able to release the batch since it will be different than null
- Otherwise, the release will be performed and the batch set to null which 
will disable the release within the finally block 


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184140733
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/memory/AllocationManager.java
 ---
@@ -253,10 +261,12 @@ public boolean transferBalance(final BufferLedger 
target) {
   target.historicalLog.recordEvent("incoming(from %s)", 
owningLedger.allocator.name);
 }
 
-boolean overlimit = target.allocator.forceAllocate(size);
+// Release first to handle the case where the current and target 
allocators were part of the same
+// parent / child tree.
 allocator.releaseBytes(size);
+boolean allocationFit = target.allocator.forceAllocate(size);
--- End diff --

What if a runtime exception is thrown during forceAllocate(...)?  


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184112400
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -153,8 +153,10 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
   public IterOutcome next() {
 batchLoader.resetRecordCount();
 stats.startProcessing();
+
+RawFragmentBatch batch = null;
--- End diff --

The release logic is a NOOP if the body is null; the while loop is to guard 
against an empty batch.


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184138876
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java 
---
@@ -77,4 +83,46 @@ public long getByteCount() {
   public boolean isAckSent() {
 return ackSent.get();
   }
+
+  /**
+   * Transfer ownership of this DrillBuf to the target allocator. This is 
done for better memory
+   * accounting (that is, the operator should be charged with the body's 
Drillbuf memory).
+   *
+   * NOTES -
+   * 
+   * This operation is a NOOP when a) the current allocator 
(associated with the DrillBuf) is not the
+   * owning allocator or b) the target allocator is already the owner
+   * When transfer happens, a new RawFragmentBatch instance is 
allocated; this is done for proper
+   * DrillBuf reference count accounting
+   * The RPC handling code caches a reference to this RawFragmentBatch 
object instance; release()
+   * calls should be routed to the previous DrillBuf
+   * 
+   *
+   * @param targetAllocator target allocator
+   * @return a new {@link RawFragmentBatch} object instance on success 
(where the buffer ownership has
+   * been switched to the target allocator); otherwise this 
operation is a NOOP (current instance
+   * returned)
+   */
+  public RawFragmentBatch transferBodyOwnership(BufferAllocator 
targetAllocator) {
+if (body == null) {
+  return this; // NOOP
+}
+
+if (!body.getLedger().isOwningLedger()
+ || body.getLedger().isOwner(targetAllocator)) {
+
+  return this;
+}
+
+int writerIndex   = body.writerIndex();
+TransferResult transferResult = 
body.transferOwnership(targetAllocator);
+
+// Set the index and increment reference count
+transferResult.buffer.writerIndex(writerIndex);
+
+// Clear the current Drillbuffer since caller will perform release() 
on the new one
+body.release();
+
+return new RawFragmentBatch(getHeader(), transferResult.buffer, 
getSender(), false);
--- End diff --

That was my original code which failed miserably:
- The RPC code has references on the DrillBuf and the RawFragmentBatch
- This means, we need to ensure that a release call is routed to the right 
DrillBuf (otherwise, the reference count logic stops working)
- Creating a new RawFragmentBatch instance essentially provided just that 
(proper reference count accounting)


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-25 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184114148
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -153,8 +153,10 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
   public IterOutcome next() {
 batchLoader.resetRecordCount();
 stats.startProcessing();
+
+RawFragmentBatch batch = null;
 try{
-  RawFragmentBatch batch;
+
--- End diff --

Can you clarify your ask?


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-24 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r183898352
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -182,13 +184,18 @@ public IterOutcome next() {
 return IterOutcome.OUT_OF_MEMORY;
   }
 
+  // Transfer the ownership of this raw-batch to this operator for 
proper memory statistics reporting
+  batch = batch.transferBodyOwnership(oContext.getAllocator());
--- End diff --

@HanumathRao,

- This will not affect the outcome since the unordered-receiver is a child 
of the context allocator (the minor fragment)
- Doing the check earlier seemed cleaner as there is no point of changing 
ownership when there is already an out-of-memory condition
- I think the code was written this way since implicitly, the network 
handlers are receiving the batch and then performing a change of ownership 
(from RPC to context allocator); this step could lead to an out-of-memory 
condition at the fragment level
- If your point is about reporting, that is attributing the OOM condition 
to the child operator, then I guess I don't have a problem with that.
@parthchandra what do you think?
 


---

[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-22 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1237

DRILL-6348: Fixed code so that Unordered Receiver reports its memory â¦

Problem Description -
- The Unordered Receiver doesn't report any memory usage because the RPC 
infrastructure associated the allocated memory ownership to the minor fragment 
container

Proposed Fix -
- Modified the code to change the received DrillBuf memory ownership from 
the parent fragment to the operator (as discussed with @parthchandra)
- Made sure that the buffer accounting logic is preserved

@parthchandra, can you please review this pull request?

Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6348

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1237


commit 9ab501a3fc3ebbda44c1478ad9db6711192b8ca1
Author: Salim Achouche <sachouche2@...>
Date:   2018-04-23T01:02:35Z

DRILL-6348: Fixed code so that Unordered Receiver reports its memory usage




---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-04-17 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
@parthchandra , @vrozov 

I have done the following modifications:
- Renamed newly added files with the prefix "VL" with "VarLen" as suggested 
by @parthchandra 
- After talking offline with @vrozov about his objections to the 
MemoryUtils and also from previous feedbacks from @parthchandra , I have a) 
deleted this utility, b) exposed the needed functionality under the class 
DrillBuf (using Netty API's), and c) used the same configuration to control the 
checks.

@parthchandra , @vrozov, please review this 
[document](https://docs.google.com/document/d/1BSNem_ItP-Vxlr6auSP_iwwOLM9rwWZYxGwCsXi-IE8/edit?usp=sharing)
 and the associated JMH code [here](https://github.com/sachouche/drill-jmh). 
This should provide you with performance benchmarks regarding the bulk approach 
that I have used in this PR request (please focus on test-3 which is Parquet 
specific); I will be adding more fine-grained tests with regard to the Memory 
Access tests (requested by @vrozov).


---

[GitHub] drill issue #1168: DRILL-6246: Reduced the size of the jdbc-all jar file

2018-04-10 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1168
  
No I didn't.

Regards,
Salim

Regards,
Salim

-Original Message-
From: Kunal Khatua [notificati...@github.com]
Received: Tuesday, 10 Apr 2018, 14:01
To: apache/drill [dr...@noreply.github.com]
CC: Salim Achouche [sachou...@mapr.com]; Mention 
[ment...@noreply.github.com]
Subject: Re: [apache/drill] DRILL-6246: Reduced the size of the jdbc-all 
jar file (#1168)



@sachouche<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sachouche=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=ORom6yiAaNx9O8pqbz4PEqbCngKTh3c4MO1H6btm4s4=>
 have you had a chance to test the generated JDBC driver with Spotfire/SQuirreL 
?

â
You are receiving this because you were mentioned.
Reply to this email directly, view it on 
GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1168-23issuecomment-2D380245152=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=Q3TGflgHce0_-5S2QAAdrJp8SQnAobk1W0ZFJwD50-Q=>,
 or mute the 
thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AckoK6qKn0CR0-2DszS5136weV3xs2MjZYks5tnR2UgaJpZM4Ssiu1=DwMCaQ=cskdkSMqhcnjZxdQVpwTXg=Yz1mbrEFs0HJzxHJZ69JBLCsKtk9LmaS_d_ncP1HZwY=ZuRTiFKVp4XyNeUONWTmOtorxN8OgUnpJeF5Nnbzpi8=xgHCT0e9lj785056Kkt2KgkhNR3pvO4wPNI_B3J5S54=>.



---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179296101
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/MemoryUtils.java
 ---
@@ -0,0 +1,199 @@
+/**
--- End diff --

Vlad, can you please explain what you meant? are you only referring for 
specific members or is it a general comment?



---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-04-04 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r179295602
  
--- Diff: exec/java-exec/pom.xml ---
@@ -836,6 +836,14 @@
 org.apache.maven.plugins
 maven-surefire-plugin
   
+  
+org.apache.maven.plugins
+maven-compiler-plugin
+2.3.2
+
+-XDignore.symbol.file
--- End diff --

Will do!


---

[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...

2018-04-03 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1144
  
I have added a comment within this PR associated JIRA: 
[DRILL-6202](https://issues.apache.org/jira/browse/DRILL-6202)


---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-03-30 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
Parth,

- I have attached, within the DRILL-5846, two profiles with latest Apache 
code and this PR request (bounds checks are off):
  o Used one thread in each run
  o I observe ~3x performance difference when the new logic is turned on
  o The difference is 4x if I include the implicit column optimization 
(which is not part of this PR)
  o The impact of the new optimizations can be felt when there are many 
variable length columns

- The rational of trying to approve this PR
   o The optimizations that I have included are local to the Flat Parquet 
Reader (incapsulated)
   o The logic is backward compatible and turned off by default
   o I have added the new Batch Sizing functionality on top of this PR 
(columnar processing pattern)
   o The result of DRILL-6301 would only result in a local refactoring step
   o Not being able to add the new code results in a substantial 
maintenance overhead

   


---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-03-29 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
@parthchandra and @vrozov can you please let me know whether you are ok 
with the changes.

Thanks!


---

[GitHub] drill issue #1170: DRILL-6223: Fixed several Drillbit failures due to schema...

2018-03-29 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1170
  
@parthchandra and @paul-rogers, I have added a comment within the Jira 
[DRILL-6223](https://issues.apache.org/jira/browse/DRILL-6223); please let me 
know what you think.

Thanks!


---

[GitHub] drill issue #1168: DRILL-6246: Reduced the size of the jdbc-all jar file

2018-03-29 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1168
  
I successfully ran the test-suite and some advanced tests which loaded the 
jdbc artifacts. Is this enough to validate the fix (and future exclusions to 
keep the jar size in-check)? if not, what other tests(s) do we need to run?

Thanks!  


---

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-03-29 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
I have updated this pull request with the following changes:
- Excluded the implicit column optimizations from this pull request (will 
be included as part of another Drill Jira)
- Tuned a threshold (16bytes vs 1024) on when to use "putByte() vs 
copyMemory()" based on @vrozov findings when benchmarking using JMH
- There were lingering questions on the use of byte arrays for intermediary 
processing; after a discussion with @vrozov we agreed to tackle such an 
analysis within a new JIRA 
([DRILL-6301](https://issues.apache.org/jira/browse/DRILL-6301?filter=-1)) so 
that the Drill customer can immediately rip the benefits of the current 
enhancements
- Moved the newly added MemoryUtils class to the Parquet flat reader 
project (made it package private); this is temporary till DRILL-6301 is 
resolved.   



---

[GitHub] drill pull request #1192: DRILL-6299: Fixed a filter pushed down issue when ...

2018-03-28 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1192

DRILL-6299: Fixed a filter pushed down issue when a column doesn't haâ¦

This bug happens when the isNull predicate is applied on a column without 
statistics. @arina-ielchiieva  can you please review this pull request?

Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6299

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1192


commit bb742f61673d0b64c34bfab9de01ffb2968b472c
Author: Salim Achouche <sachouche2@...>
Date:   2018-03-28T19:08:25Z

DRILL-6299: Fixed a filter pushed down issue when a column doesn't have 
stats




---

[GitHub] drill issue #1170: DRILL-6223: Fixed several Drillbit failures due to schema...

2018-03-17 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1170
  
@parthchandra can you also please review this PR? 
Thanks!



---

[GitHub] drill issue #1170: DRILL-6223: Fixed several Drillbit failures due to schema...

2018-03-15 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1170
  
@amansinha100 can you please review this pull request?

Thanks!


---

[GitHub] drill pull request #1170: DRILL-6223: Fixed several Drillbit failures due to...

2018-03-15 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1170

DRILL-6223: Fixed several Drillbit failures due to schema changes

Fixed several Issues due to Schema changes:
1) Changes in complex data types
Drill Query Failing when selecting all columns from a Complex Nested Data 
File (Parquet) Set). There are differences in Schema among the files:

The Parquet files exhibit differences both at the first level and within 
nested data types
A select * will not cause an exception but using a limit clause will
Note also this issue seems to happen only when multiple Drillbit minor 
fragments are involved (concurrency higher than one)

2) Dangling columns (both simple and complex)
This situation can be easily reproduced for:
- Select STAR queries which involve input data with different schemas
- LIMIT or / and PROJECT operators are used
- The data will be read from more than one minor fragment
- This is because individual readers have logic to handle such use-cases 
but not downstream operators
- So is reader-1 sends one batch with F1, F2, and F3
- The reader-2 sends batch F2, F3
- Then the LIMIT and PROJECT operator will fail to cleanup the dangling 
column F1 which will cause failures when downstream operators copy logic 
attempts copy the stale column F1
- This pull request adds logic to detect and eliminate dangling columns   

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6223

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1170.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1170


commit d986b6c7588c107bb7e49d2fc8eb3f25a60e1214
Author: Salim Achouche <sachouche2@...>
Date:   2018-02-21T02:17:14Z

DRILL-6223: Fixed several Drillbit failures due to schema changes




---

[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...

2018-03-15 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1168#discussion_r174889287
  
--- Diff: exec/jdbc-all/pom.xml ---
@@ -473,6 +473,8 @@
org/yaml/**
hello/**
webapps/**
+   **/org/apache/calcite/avatica/metrics/**
+   **/org/apache/calcite/avatica/org/**
--- End diff --

done!


---

[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...

2018-03-15 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1168#discussion_r174870229
  
--- Diff: exec/jdbc-all/pom.xml ---
@@ -473,6 +473,8 @@
org/yaml/**
hello/**
webapps/**
+   **/org/apache/calcite/avatica/metrics/**
+   **/org/apache/calcite/avatica/org/**
--- End diff --

The Apache jar size has shrunk by ~1MB from over 34MB to 33393541 bytes; I 
guess we can decrease the limit from 35MB to 34MB.


---

[GitHub] drill issue #1168: DRILL-6246: Reduced the size of the jdbc-all jar file

2018-03-15 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1168
  
@sohami can you please review this pull request?

Thanks!


---

[GitHub] drill pull request #1168: DRILL-6246: Reduced the size of the jdbc-all jar f...

2018-03-15 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1168

DRILL-6246: Reduced the size of the jdbc-all jar file

- The jdbc-all client jar has been growing in size from version to the next 
(~20MB in version 1.10, ~27MB in version 1.12, and 34MB in version 1.13
- Note the exact size of the size depends on the maven profile used during 
compilation (e.g., "mapr" which have more excludes)
- Originally, I tried to exclude the following calcite/avatica packages 
(responsible for the recent size increase): metrics, proto, org/apache/, 
com/fasterxml, com/google but some tests failed as the calcite code was 
referencing some of them
- In this pull request, I have excluded the following packages: 
calcite/avatica/org and calcite/avatica/metrics

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6246

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1168


commit 6b6188a6c1677595c479f67fc2e33af091c36818
Author: Salim Achouche <sachouche2@...>
Date:   2018-03-14T02:16:06Z

DRILL-6246: Reduced the size of the jdbc-all jar file




---

[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-06 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r166448594
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -77,7 +77,7 @@ private ExecConstants() {
   public static final String SPILL_DIRS = "drill.exec.spill.directories";
 
   public static final String OUTPUT_BATCH_SIZE = 
"drill.exec.memory.operator.output_batch_size";
-  public static final LongValidator OUTPUT_BATCH_SIZE_VALIDATOR = new 
RangeLongValidator(OUTPUT_BATCH_SIZE, 1024, 512 * 1024 * 1024);
+  public static final LongValidator OUTPUT_BATCH_SIZE_VALIDATOR = new 
RangeLongValidator(OUTPUT_BATCH_SIZE, 1, 512 * 1024 * 1024);
--- End diff --

Theoretically, 2Gb-1 should be the largest value as our value vectors use 
integers as offset values.


---

[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-06 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r166450263
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -102,20 +105,78 @@
   private final List comparators;
   private final JoinRelType joinType;
   private JoinWorker worker;
+  private final long outputBatchSize;
 
   private static final String LEFT_INPUT = "LEFT INPUT";
   private static final String RIGHT_INPUT = "RIGHT INPUT";
 
+  private class MergeJoinMemoryManager extends 
AbstractRecordBatchMemoryManager {
+private int leftRowWidth;
+private int rightRowWidth;
+
+/**
+ * mergejoin operates on one record at a time from the left and right 
batches
+ * using RecordIterator abstraction. We have a callback mechanism to 
get notified
+ * when new batch is loaded in record iterator.
+ * This can get called in the middle of current output batch we are 
building.
+ * when this gets called, adjust number of output rows for the current 
batch and
+ * update the value to be used for subsequent batches.
+ */
+@Override
+public void update(int inputIndex) {
+  switch(inputIndex) {
+case 0:
+  final RecordBatchSizer leftSizer = new RecordBatchSizer(left);
+  leftRowWidth = leftSizer.netRowWidth();
+  break;
+case 1:
+  final RecordBatchSizer rightSizer = new RecordBatchSizer(right);
+  rightRowWidth = rightSizer.netRowWidth();
+default:
+  break;
+  }
+
+  final int newOutgoingRowWidth = leftRowWidth + rightRowWidth;
+
+  // If outgoing row width is 0, just return. This is possible for 
empty batches or
+  // when first set of batches come with OK_NEW_SCHEMA and no data.
+  if (newOutgoingRowWidth == 0) {
+return;
+  }
+
+  // update the value to be used for next batch(es)
+  setOutputRowCount(Math.min(ValueVector.MAX_ROW_COUNT,
+
Math.max(RecordBatchSizer.safeDivide(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR,
 newOutgoingRowWidth), MIN_NUM_ROWS)));
--- End diff --

Our goal is to eventually use Paul's framework for controlling batch sizes. 
I feel the BatchRecordMemoryManager terminology is a bit ambitious (kind of 
misleading).


---

[GitHub] drill issue #1011: Drill 1170: Drill-on-YARN

2018-02-05 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1011
  
+1
I have reviewed the code and overall looks good. My main feedback is that 
the current implementation doesn't currently support secure clusters (at least 
didn't see any logic associated with that). Yarn applications have issues 
staying up for a long time because of ticket renewal limitations. We might want 
to create another enhancement JIRA to support such use-cases.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-02-02 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r165698800
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -703,7 +703,18 @@ protected void _setLong(int index, long value) {
 
   @Override
   public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int 
length) {
-udle.getBytes(index + offset, dst, dstIndex, length);
+final int BULK_COPY_THR = 1024;
--- End diff --

Vlad,

- I had a chat with @bitblender and he explains that Java was invoking a 
stub (not a function call) to perform copyMemory; he agreed copyMemory will be 
slower for small buffers and the task was to determine the cutoff point
- My tests (I will send you my test) indicate that a length of 1024bytes is 
the length were copyMemory starts performing exactly as getByte()

NOTE - I am using JRE 1.8; static buffers initialized once; payload 1MB 
(1048576bytes) and loop-count of 102400; MacOS High Sierra; 1 thread, 4GB MX, MS


---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

2018-02-01 Thread sachouche

GitHub user sachouche reopened a pull request:

https://github.com/apache/drill/pull/1106

DRILL-6129: Fixed query failure due to nested column data type change

Problem Description -
- The Drillbit was able to successfully send batches containing different 
metadata (for nested columns)
- This was the case when one or multiple scanners were involved
- The issue happened within the client where value vectors are cached 
across batches
- The load(...) API is responsible for updating values vectors when a new 
batch arrives
- The RecordBatchLoader class is used to detect schema changes ; if this is 
the case, then previous value vectors are discarded and new ones created
- There is a bug with the current implementation where only first level 
columns are compared

Fix -
- The fix is to improve the schema diff logic by including nested columns

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1106


commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4
Author: Salim Achouche <sachouche2@...>
Date:   2018-02-01T02:59:58Z

DRILL-6129: Fixed query failure due to nested column data type change




---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

2018-02-01 Thread sachouche

Github user sachouche closed the pull request at:

https://github.com/apache/drill/pull/1106


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

2018-02-01 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1106
  
Thank you Aman and Paul for reviewing this JIRA. I have created a new JIRA 
[DRILL-6131](https://issues.apache.org/jira/browse/DRILL-6131) to track the new 
SchemaComparator utility for sharing schema comparison logic.


---

[GitHub] drill issue #1106: DRILL-6129: Fixed query failure due to nested column data...

2018-02-01 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1106
  
Thanks @paul-rogers for the information; I went through the PR and noticed 
the following:

- BatchSchema invokes MaterializedField.isEquivalent()
- With my fix, both methods consider nested columns but they have several 
differences

1) RecordBatchLoader requires sameness as this information is used to reuse 
the value vectors; if old and new batch are deemed same, then the value vectors 
are reloaded using the load(...) API. The metadata better be the same or a 
runtime exception will occur

2) RecordBatchLoader isSame(...) API compares two different java objects: 
SerializedField (obtained from protobufs) and already materialized value 
vectors MaterializedField

3) RecordBatchLoader isSame(...) API tolerates unordered fields (within the 
same level) but not MaterializedField.isEquivalent() method

4) MaterializedField.isEquivalent() ignores hidden columns such "$bits" and 
"$offsets" but not RecordBatchLoader isSame(...)

I think moving forward, the best way to prevent bugs with regard to schema 
changes is by maintaining a document that establishes all the rules. This will 
allow QA to refine their tests and catch current limitations.  



---

[GitHub] drill pull request #1106: DRILL-6129: Fixed query failure due to nested colu...

2018-01-31 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1106

DRILL-6129: Fixed query failure due to nested column data type change

Problem Description -
- The Drillbit was able to successfully send batches containing different 
metadata (for nested columns)
- This was the case when one or multiple scanners were involved
- The issue happened within the client where value vectors are cached 
across batches
- The load(...) API is responsible for updating values vectors when a new 
batch arrives
- The RecordBatchLoader class is used to detect schema changes ; if this is 
the case, then previous value vectors are discarded and new ones created
- There is a bug with the current implementation where only first level 
columns are compared

Fix -
- The fix is to improve the schema diff logic by including nested columns

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6129

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1106.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1106


commit 9ffb41f509cd2531e7f3cdf89a66605ec0fdf7a4
Author: Salim Achouche <sachouche2@...>
Date:   2018-02-01T02:59:58Z

DRILL-6129: Fixed query failure due to nested column data type change




---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-29 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164532290
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -703,7 +703,18 @@ protected void _setLong(int index, long value) {
 
   @Override
   public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int 
length) {
-udle.getBytes(index + offset, dst, dstIndex, length);
+final int BULK_COPY_THR = 1024;
--- End diff --

You are right, I'll put more information on this optimization:
- During code profiling, I have noticed that getBytes() doesn't perform 
well when called with small length (lower than 1k)
- Its throughput improves as the length increases

Analysis :
- Java exposes two intrinsics for writing to direct memory: putByte and 
copyMemory
- The JVM is able to inline memory access (no function call) for putByte 
- copyMemory is a bulk API and this internally invokes libc memcpy 
(requires function call)
- The rational is that we are willing to incur a function call if the 
associated processing is larger than the overhead; this is almost never the 
case for small memory accesses. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-29 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164525752
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Vlad,

Can you please clarify your point? Are you suggesting that every memory 
access has to be via DrillBuf even if the data is loaded in Java heap and the 
developer is performing an optimization. If this is your goal, then I would 
think the danger is that the DrillBuf class will be cluttered and contracts 
hard to understand:
- Notice that DrillBuf performs expensive sanity checks (boundary and 
reference count checks); this is due to Drill memory optimizations for 
minimizing the overall memory usage.
- When the code is processing already load data, these checks are not 
applicable anymore as this data is private.
- I could have easily moved the new APIs there but I strongly felt this was 
inappropriate.. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-29 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r164509658
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Vlad,

- The reader has loaded data from direct memory into CPU level-1 cache 
through DrillBuf
- The data is stored (4k at a time) in a Java byte[]
- The only reason for using MemoryUtil is to perform 64bit processing (I 
have previously indicated that many other processing intensive applications 
such as Snappy and Hadoop Comparison use the same pattern)
- I solely added this logic after noticing the Hotspot was generating 8bit 
instructions 

Again, the intent is not to access data from DrillBuf (the data has been 
already loaded into CPU cache); this is mainly an optimization step.  


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828863
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -68,96 +85,441 @@
 
   private final UInt1Vector bits = new UInt1Vector(bitsField, allocator);
   private final ${valuesName} values = new ${minor.class}Vector(field, 
allocator);
+  private final Mutator mutator  = new MutatorImpl();
+  private final Accessor accessor= new AccessorImpl();
+
+  <#if type.major == "VarLen" && minor.class == "VarChar">
+  private final Mutator dupMutator   = new DupValsOnlyMutator();
+  /** Accessor instance for duplicate values vector */
+  private final Accessor dupAccessor = new DupValsOnlyAccessor();
+  /** Optimization for cases where all values are identical */
+  private boolean duplicateValuesOnly;
+  /** logical number of values */
+  private int logicalNumValues;
+  /** logical value capacity */
+  private int logicalValueCapacity;
+  /** Mutator instance for duplicate values vector */
+
+  /** true if this vector holds the same value albeit repeated */
+  public boolean isDuplicateValsOnly() {
+return duplicateValuesOnly;
+  }
 
-  private final Mutator mutator = new Mutator();
-  private final Accessor accessor = new Accessor();
+  /**
+   * Sets this vector duplicate values mode; the {@link #clear()} method 
wil also be called as a side effect
+   *  of this operation
+   */
+  public void setDuplicateValsOnly(boolean valsOnly) {
+clear();
+duplicateValuesOnly = valsOnly;
+  }
 
-  public ${className}(MaterializedField field, BufferAllocator allocator) {
-super(field, allocator);
+  /** {@inheritDoc} */
+  @Override
+  public int getValueCapacity(){
+if (!isDuplicateValsOnly()) {
+return Math.min(bits.getValueCapacity(), values.getValueCapacity());
+  }
+return logicalValueCapacity;
   }
 
+  /** {@inheritDoc} */
   @Override
-  public FieldReader getReader(){
-return reader;
+  public void close() {
+bits.close();
+values.close();
+super.close();
   }
 
+  /** {@inheritDoc} */
   @Override
-  public int getValueCapacity(){
-return Math.min(bits.getValueCapacity(), values.getValueCapacity());
+  public void clear() {
+bits.clear();
+values.clear();
+super.clear();
+if (isDuplicateValsOnly()) {
+  logicalNumValues = 0;
+  logicalValueCapacity = 0;
+  duplicateValuesOnly  = false;
+}
   }
 
+  /** {@inheritDoc} */
   @Override
-  public DrillBuf[] getBuffers(boolean clear) {
-final DrillBuf[] buffers = ObjectArrays.concat(bits.getBuffers(false), 
values.getBuffers(false), DrillBuf.class);
-if (clear) {
-  for (final DrillBuf buffer:buffers) {
-buffer.retain(1);
+  public int getBufferSizeFor(final int valueCount) {
+assert valueCount >= 0;
+
+if (valueCount == 0) {
+  return 0;
+}
+if (!isDuplicateValsOnly()) {
+  return values.getBufferSizeFor(valueCount) + 
bits.getBufferSizeFor(valueCount);
+}
+return values.getBufferSizeFor(1) + bits.getBufferSizeFor(1);
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public ${valuesName} getValuesVector() {
+if (!isDuplicateValsOnly()) {
+  return values;
+}
+throw new UnsupportedOperationException();
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public void setInitialCapacity(int numRecords) {
+assert numRecords >= 0;
+if (!isDuplicateValsOnly()) {
+  bits.setInitialCapacity(numRecords);
+  values.setInitialCapacity(numRecords);
+} else {
+  bits.setInitialCapacity(numRecords > 0 ? 1 : 0);
+  values.setInitialCapacity(numRecords > 0 ? 1 : 0);
+  logicalValueCapacity = numRecords;
+}
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public Accessor getAccessor() {
+if (!isDuplicateValsOnly()) {
+  return accessor;
+}
+return dupAccessor;
+  }
+
+  /** {@inheritDoc} */
+  @Override
+  public Mutator getMutator() {
+if (!isDuplicateValsOnly()) {
+  return mutator;
+}
+return dupMutator;
+  }
+
+  public void copyFrom(int fromIndex, int thisIndex, 
Nullable${minor.class}Vector from) {
+final Accessor fromAccessor = from.getAccessor();
+if (isDuplicateValsOnly()) {
+  // We currently don't have a use-case where a target dup-vector is 
provisioned by a non-dup source vector
+  assert isD

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828312
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -51,6 +57,17 @@
 public final class ${className} extends BaseDataValueVector implements 
<#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, 
NullableVector {
   private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(${className}.class);
 
+  /**
+   * Optimization to set contiguous values nullable state in a bulk 
manner; cannot define this array
+   * within the Mutator class as Java doesn't allow static initialization 
within a non static inner class.
+   */
+  private static final int DEFINED_VALUES_ARRAY_LEN = 1 << 10;
+  private static final byte[] DEFINED_VALUES_ARRAY  = new 
byte[DEFINED_VALUES_ARRAY_LEN];
--- End diff --

I rather have it within the sub-class which defines the associated 
functionality (that is nullable DT); the wasted memory is also negligible.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828788
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -68,96 +85,441 @@
 
   private final UInt1Vector bits = new UInt1Vector(bitsField, allocator);
   private final ${valuesName} values = new ${minor.class}Vector(field, 
allocator);
+  private final Mutator mutator  = new MutatorImpl();
+  private final Accessor accessor= new AccessorImpl();
+
+  <#if type.major == "VarLen" && minor.class == "VarChar">
+  private final Mutator dupMutator   = new DupValsOnlyMutator();
+  /** Accessor instance for duplicate values vector */
+  private final Accessor dupAccessor = new DupValsOnlyAccessor();
+  /** Optimization for cases where all values are identical */
+  private boolean duplicateValuesOnly;
+  /** logical number of values */
+  private int logicalNumValues;
+  /** logical value capacity */
+  private int logicalValueCapacity;
+  /** Mutator instance for duplicate values vector */
+
+  /** true if this vector holds the same value albeit repeated */
+  public boolean isDuplicateValsOnly() {
--- End diff --

Tried that (that was my first attempt) but the main issue is that the Drill 
factory creates vectors based on the column metadata alone. This design has the 
advantage of enabling / rolling-back optimizations transparently from the 
consumer. I also made sure there is no performance penalty (or at least 
minimal).


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-21 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r162828174
  
--- Diff: exec/vector/src/main/codegen/templates/NullableValueVectors.java 
---
@@ -51,6 +57,17 @@
 public final class ${className} extends BaseDataValueVector implements 
<#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, 
NullableVector {
   private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(${className}.class);
 
+  /**
+   * Optimization to set contiguous values nullable state in a bulk 
manner; cannot define this array
+   * within the Mutator class as Java doesn't allow static initialization 
within a non static inner class.
+   */
+  private static final int DEFINED_VALUES_ARRAY_LEN = 1 << 10;
--- End diff --

1 << 10 is 1024; not sure how did you arrive at 10,000,000,000..


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162124359
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
 
-final int txtLength = end - start;
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
+}
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length three */
+  private final class MatcherT

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162120115
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
--- End diff --

removed.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162123690
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
--- End diff --

Good idea; done this for all the matchers.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162119757
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
--- End diff --

cut-and-paste issue :-)


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162121115
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
--- End diff --

renamed all such variables to XXXPatternBytes


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-17 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r162124388
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
 
-final int txtLength = end - start;
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
+}
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length three */
+  private final class MatcherT

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-01-12 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
@paul-rogers with regard to the design aspects that you brought up:
** Corrections about the proposed Design **
Your analysis somehow assumes the Vector is the one driving the loading 
operation. This is not the case, a) it is the reader which invokes the bulk 
save API b) the Vector save method runs a loop requesting the rest of the data. 
The reader can any time stop the loading phase; currently the reader uses the 
batch size to make this decision. The plan is to enhance this logic by having a 
feedback from the Vector save method about memory usage (e.g., save current row 
set with the condition the memory usage doesn't get X bytes). I believe I have 
raised this design decision early on and the agreement was to implement the 
memory batch restrictions in a next Drill release.

** Agile Development **
Through the years I came to appreciate the value of agile development. One 
needs to prioritize design activities; it is completely valid to start with a 
design (which is not guaranteed to be perfect) and then refine it overtime as 
long as the underlying implementation doesn't spill to other modules 
(encapsulation). This has the merit of showcasing incremental improvements and 
allowing the developer to get new insight as they have a better understanding.
   


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161065237
  
--- Diff: protocol/src/main/protobuf/UserBitShared.proto ---
@@ -148,6 +148,8 @@ message SerializedField {
   optional int32 value_count = 4;
   optional int32 var_byte_length = 5;
   optional int32 buffer_length = 7;
+  optional bool is_dup = 8;
+  optional int32 logical_value_count = 9;
--- End diff --

Just had an offline conversation with Paul and Kunal. The agreement is that 
we should introduce a versioning mechanism at the protocol level so that 
clients could advertise their version identifier; this way the server can use 
this information to turn on / off features based on the client capabilities. 
The task is only to create the preliminary versioning infrastructure; more 
sophistication can be added later on. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161040070
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -309,7 +314,7 @@ public void setInitialCapacity(final int valueCount) {
 if (size > MAX_ALLOCATION_SIZE) {
   throw new OversizedAllocationException("Requested amount of memory 
is more than max allowed allocation size");
 }
-allocationSizeInBytes = (int) size;
+allocationSizeInBytes = (int)size;
--- End diff --

Fixed.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161039122
  
--- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java ---
@@ -874,6 +880,46 @@ public void setSafe(int index, BigDecimal value) {
   set(index, value);
 }
 
+/**
+ * Copies the bulk input into this value vector and extends its 
capacity if necessary.
+ * @param input bulk input
+ */
+public  void setSafe(VLBulkInput input) {
+  setSafe(input, null);
+}
+
+/**
+ * Copies the bulk input into this value vector and extends its 
capacity if necessary. The callback
+ * mechanism allows decoration as caller is invoked for each bulk 
entry.
+ *
+ * @param input bulk input
+ * @param callback a bulk input callback object (optional)
+ */
+public  void setSafe(VLBulkInput input, 
VLBulkInput.BulkInputCallback callback) {
--- End diff --

This code is not Parquet specific. Instead, it can be triggered by any 
Reader which desires to load data in a bulk fashion. Vectors currently expose 
Mutator APIs for loading single values; I see no good reason which prevent us 
from passing bulk values instead of a single one at a time which prevent us 
from code optimization. Look at ByBuffer APIs they allow you to pass single 
byte values but also byte arrays.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161036145
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/util/MemoryUtils.java ---
@@ -0,0 +1,186 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.util;
+
+import java.lang.reflect.Field;
+import java.nio.ByteOrder;
+
+import sun.misc.Unsafe;
+
+/** Exposes advanced Memory Access APIs for Little-Endian / Unaligned 
platforms */
+@SuppressWarnings("restriction")
+public final class MemoryUtils {
+
+  // Ensure this is a little-endian hardware */
+  static {
+if (ByteOrder.nativeOrder() != ByteOrder.LITTLE_ENDIAN) {
+  throw new IllegalStateException("Drill only runs on LittleEndian 
systems.");
+}
+  }
+
+  /** Java's unsafe object */
+  private static Unsafe UNSAFE;
+
+  static {
+try {
+  Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
+  theUnsafe.setAccessible(true);
+  UNSAFE = (Unsafe) theUnsafe.get(null);
+}
+catch (Exception e) {
+  throw new RuntimeException(e);
+}
+  }
+
+  /** Byte arrays offset */
+  private static final long BYTE_ARRAY_OFFSET = 
UNSAFE.arrayBaseOffset(byte[].class);
+
+  /** Number of bytes in a long */
+  public static final int LONG_NUM_BYTES  = 8;
+  /** Number of bytes in an int */
+  public static final int INT_NUM_BYTES   = 4;
+  /** Number of bytes in a short */
+  public static final int SHORT_NUM_BYTES = 2;
+

+//
+// APIs

+//
+
+  /**
+   * @param data source byte array
+   * @param index index within the byte array
+   * @return short value starting at data+index
+   */
+  public static short getShort(byte[] data, int index) {
--- End diff --

Please refer to my reply to Parth. I personally think the memory package 
should be our main abstraction (of course along with guidelines). Putting all 
memory access APIs within the DrillBuf class will only clutter that class.. I 
rather have a centralized memory utility class for advanced processing which do 
not involve a DrillBuf.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161045748
  
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
@@ -386,7 +391,7 @@ public void reAlloc() {
   throw new OversizedAllocationException("Unable to expand the buffer. 
Max allowed buffer size is reached.");
 }
 
-logger.trace("Reallocating VarChar, new size {}", newAllocationSize);
+logger.trace("Reallocating VarChar, new size {}",newAllocationSize);
--- End diff --

Fixed.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161030963
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
 ---
@@ -449,7 +451,7 @@ public void close() throws Exception {
 // been created. Gracefully handle that case.
 
 if (options != null) {
-  options.close();
-}
+options.close();
   }
 }
+}
--- End diff --

The issue was with the close() method which was not properly indented. 
Fixed it.


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161032779
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/NullableColumnReader.java
 ---
@@ -165,17 +181,133 @@
   + "Run Length: {} \t Null Run Length: {} \t readCount: {} \t 
writeCount: {} \t "
   + "recordsReadInThisIteration: {} \t 
valuesReadInCurrentPass: {} \t "
   + "totalValuesRead: {} \t readStartInBytes: {} \t 
readLength: {} \t pageReader.byteLength: {} \t "
-  + "definitionLevelsRead: {} \t pageReader.currentPageCount: 
{}",
+  + "currPageValuesProcessed: {} \t 
pageReader.currentPageCount: {}",
   recordsToReadInThisPass, runLength, nullRunLength, readCount,
   writeCount, recordsReadInThisIteration, valuesReadInCurrentPass,
   totalValuesRead, readStartInBytes, readLength, 
pageReader.byteLength,
-  definitionLevelsRead, pageReader.currentPageCount);
+  currPageValuesProcessed, pageReader.currentPageCount);
 
 }
 
 valueVec.getMutator().setValueCount(valuesReadInCurrentPass);
   }
 
+  private final void processPagesBulk(long recordsToReadInThisPass) throws 
IOException {
+readStartInBytes   = 0;
+readLength = 0;
+readLengthInBits   = 0;
+recordsReadInThisIteration = 0;
+vectorData = castedBaseVector.getBuffer();
+
+// values need to be spaced out where nulls appear in the column
+// leaving blank space for nulls allows for random access to values
+// to optimize copying data out of the buffered disk stream, runs of 
defined values
+// are located and copied together, rather than copying individual 
values
+
+int valueCount   = 0;
+final int maxValuesToProcess = Math.min((int) recordsToReadInThisPass, 
valueVec.getValueCapacity());
+
+// To handle the case where the page has been already loaded
+if (pageReader.definitionLevels != null && currPageValuesProcessed == 
0) {
+  definitionLevelWrapper.set(pageReader.definitionLevels, 
pageReader.currentPageCount);
+}
+
+while (valueCount < maxValuesToProcess) {
+
+  // read a page if needed
+  if (!pageReader.hasPage() || (currPageValuesProcessed == 
pageReader.currentPageCount)) {
+if (!pageReader.next()) {
+  break;
+}
+
+//New page. Reset the definition level.
+currPageValuesProcessed= 0;
+recordsReadInThisIteration = 0;
+readStartInBytes   = 0;
+
+// Update the Definition Level reader
+definitionLevelWrapper.set(pageReader.definitionLevels, 
pageReader.currentPageCount);
+  }
+
+  definitionLevelWrapper.readFirstIntegerIfNeeded();
+
+  int numNullValues   = 0;
+  int numNonNullValues= 0;
+  final int remaining = maxValuesToProcess - valueCount;
+  int currBatchSz = Math.min(remaining, 
(pageReader.currentPageCount - currPageValuesProcessed));
+  assert currBatchSz > 0;
+
+  // Let's skip the next run of nulls if any ...
+  int idx;
+  for (idx = 0; idx < currBatchSz; ++idx) {
+if (definitionLevelWrapper.readCurrInteger() == 1) {
+  break; // non-value encountered
+}
+definitionLevelWrapper.nextIntegerIfNotEOF();
+  }
+  numNullValues += idx;
+
+  // Write the nulls if any
--- End diff --

This is the original logic for processing fixed precision columns (except 
more efficient); the intent is to figure out how many null values there are 
before processing non-null values. We could have avoided this for-loop if we 
controlled the Parquet ValuesReader API; I was tempted to do that but decided 
to prioritize my tasks. The good news is that I recorded all such improvement 
opportunities and will try to tackle them at some point. 


---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-11 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r161028252
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
@@ -471,8 +519,8 @@ public void close() throws Exception {
 container.clear();
 mutator.clear();
 if (currentReader != null) {
-  currentReader.close();
-}
+currentReader.close();
+  }
--- End diff --

done.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160759375
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160763490
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160753323
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160765882
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/expr/fn/impl/TestSqlPatterns.java
 ---
@@ -446,6 +446,61 @@ public void testSqlPatternComplex() {
 assertEquals(1, sqlPatternComplex.match(0, byteBuffer.limit(), 
drillBuf)); // should match
   }
 
+  @Test
+  public void testSqlPatternContainsMUltipleMatchers() {
--- End diff --

Paul, the test-suite already has many other tests for SQL matching. But I 
agree, they now might hit only few of these matcher. I will add more tests.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160755097
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160758370
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160764681
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-10 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r160754694
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
--- End diff --

Good point Paul!


---

[GitHub] drill pull request #1001: JIRA DRILL-5879: Like operator performance improve...

2018-01-10 Thread sachouche

Github user sachouche closed the pull request at:

https://github.com/apache/drill/pull/1001


---

[GitHub] drill issue #1001: JIRA DRILL-5879: Like operator performance improvements

2018-01-10 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1001
  
Created another pull request #1072to merge my changes with the one done 
with Padma's.


---

[GitHub] drill issue #1087: Attempt to fix memory leak in Parquet

2018-01-10 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1087
  
Thank you Arina for catching this; I created the commit for QA before 
vacation so that they could verify the fix. At that time, I didn't have an 
Apache JIRA. I have now updated the comment to reflect the JIRA id DRILL-6079 
which is also the name of this remote branch.


---

[GitHub] drill issue #1087: Attempt to fix memory leak in Parquet

2018-01-09 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1087
  
@parthchandra can you please review this fix?


---

[GitHub] drill pull request #1087: Attempt to fix memory leak in Parquet

2018-01-09 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1087

Attempt to fix memory leak in Parquet

** Problem Description **
This is an extremely rare leak which I was able to emulate by putting a 
sleep in the AsyncPageReader right after reading the page and before enqueue in 
the result queue. This is how this issue could manifest itself in real life 
scenario:
- AsyncPageReader reads a page into a buffer but didn't enqueue yet the 
result (thread got preempted)
- Parquet Scan thread blocked waiting on the task (Future object dequeued)
- Cancel received and Scan thread interrupted 
- Future.get() returns (Future object is lost)
- Scan thread executes release logic
- Scan thread is not able to interrupt the AsyncPageReader thread since the 
future object is lost
-  AsyncPageReader thread resumes and enqueues the DrillBuf in the result 
queue
- This results in a leak since this buffer is not properly released

** Fix Description **
- The fix is straightforward as we peek the Future object during the 
blocking get() method
- This way, an exception (such as an interrupt) will leave the Future 
object in the task queue
- The cleanup logic will be able to guarantee the DrillBuf object is either 
GCed by the AsyncPageReader or ParquetScan thread

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-6079

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1087


commit 52030d1d9cc3b8992a10ade8c7126d66e785043a
Author: Salim Achouche <sachouche2@...>
Date:   2017-12-22T19:50:56Z

Attempt to fix memory leak in Parquet




---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2018-01-09 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1060#discussion_r160514755
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/VLAbstractEntryReader.java
 ---
@@ -0,0 +1,215 @@

+/***
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ 
**/
+package org.apache.drill.exec.store.parquet.columnreaders;
+
+import java.nio.ByteBuffer;
+
+import 
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.ColumnPrecisionInfo;
+import 
org.apache.drill.exec.store.parquet.columnreaders.VLColumnBulkInput.PageDataInfo;
+import org.apache.drill.exec.util.MemoryUtils;
+
+/** Abstract class for sub-classes implementing several algorithms for 
loading a Bulk Entry */
+abstract class VLAbstractEntryReader {
+
+  /** byte buffer used for buffering page data */
+  protected final ByteBuffer buffer;
+  /** Page Data Information */
+  protected final PageDataInfo pageInfo;
+  /** expected precision type: fixed or variable length */
+  protected final ColumnPrecisionInfo columnPrecInfo;
+  /** Bulk entry */
+  protected final VLColumnBulkEntry entry;
+
+  /**
+   * CTOR.
+   * @param _buffer byte buffer for data buffering (within CPU cache)
+   * @param _pageInfo page being processed information
+   * @param _columnPrecInfo column precision information
+   * @param _entry reusable bulk entry object
+   */
+  VLAbstractEntryReader(ByteBuffer _buffer,
+PageDataInfo _pageInfo,
+ColumnPrecisionInfo _columnPrecInfo,
+VLColumnBulkEntry _entry) {
+
+this.buffer = _buffer;
+this.pageInfo   = _pageInfo;
+this.columnPrecInfo = _columnPrecInfo;
+this.entry  = _entry;
+  }
+
+  /**
+   * @param valuesToRead maximum values to read within the current page
+   * @return a bulk entry object
+   */
+  abstract VLColumnBulkEntry getEntry(int valsToReadWithinPage);
+
+  /**
+   * Indicates whether to use bulk processing
+   */
+  protected final boolean bulkProcess() {
+return columnPrecInfo.bulkProcess;
+  }
+
+  /**
+   * Loads new data into the buffer if empty or the force flag is set.
+   *
+   * @param force flag to force loading new data into the buffer
+   */
+  protected final boolean load(boolean force) {
+
+if (!force && buffer.hasRemaining()) {
+  return true; // NOOP
+}
+
+// We're here either because the buffer is empty or we want to force a 
new load operation.
+// In the case of force, there might be unprocessed data (still in the 
buffer) which is fine
+// since the caller updates the page data buffer's offset only for the 
data it has consumed; this
+// means unread data will be loaded again but this time will be 
positioned in the beginning of the
+// buffer. This can happen only for the last entry in the buffer when 
either of its length or value
+// is incomplete.
+buffer.clear();
+
+int remaining = remainingPageData();
+int toCopy= remaining > buffer.capacity() ? buffer.capacity() : 
remaining;
+
+if (toCopy == 0) {
+  return false;
+}
+
+pageInfo.pageData.getBytes(pageInfo.pageDataOff, buffer.array(), 
buffer.position(), toCopy);
--- End diff --

@parthchandra I know this is counter intuitive but this path makes it 
faster and I had VTune to prove it; let me explain it:
- The logic is to fetch small chunks of data from main memory to the L1 
cache (using 4k buffers) for processing
- This logic doesn't introduce any extra memory bandwidth penalty since in 
all cases data has to move from main memory to the CPU and back to main memory 
(if modified)
- Note that the CPU

[GitHub] drill issue #1060: DRILL-5846: Improve parquet performance for Flat Data Typ...

2018-01-09 Thread sachouche

Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1060
  
Before I reply to the provided comments I want first to thank both Parth 
and Paul for taking time to review this Pull Request.
@parthchandra Regarding the High Level Comments

- FS Access Manager
The attached document listed several possible improvements in terms of CPU, 
Partitioning (changes to the query plan), and IO layer. I didn't want to club 
all the changes in a single pull request; instead I started with the Pull 
Request with the highest priority (in terms of performance improvement) which 
is CPU related enhancements.

- Drill Memory Package
I understand where you're coming from; I have analyzed this task and came 
up with a solution which I felt was the cleanest. Let me take you through my 
thinking process and the alternative solutions that I have considered:
a) Why did I use this low level APIs?
The Bulk Parquet reader uses the Drill Buffer APIs to copy data chunks to 
the CPU cache; this means that memory sanity checks are always performed. 
During code profiling I have noticed the Hotspot was using too many 
instructions (byte based assembly was used); I have researched ways to improve 
this and found some Hadoop and Compression java code which utilized 64bit 
access instructions when manipulating byte arrays (e.g., 
[Snappy](https://github.com/airlift/aircompressor/blob/master/src/main/java/io/airlift/compress/snappy/SnappyRawCompressor.java)).
  This makes a huge difference

b) Safety
As indicated previously, native data is always accessed through the Drill 
Buffer APIs (that is, accessed data is sanitized). During bulk processing, the 
new utility has additional checks which are triggered when assertions are 
enabled. This means all test-suite and QA tests (which run with assertions On) 
will be able to catch any faulty memory access; note also the checks are 
simpler since they only pertain to byte arrays boundary checks.

c) Abstraction
I agree that the Drill code base should follow a common code path when 
accessing native memory (minimize chances of introducing bugs), though I felt 
the Drill Memory package was the main abstraction since it was centralized and 
thus easier to test.

d) Alternatives
I have considered using the Netty io.netty.util.internal.PlatformDependent0 
to access the Java intrinsics APIs though this class has a package scope and 
the wrapper public class io.netty.util.internal.PlatformDependent doesn't 
expose this low level functionality. Thus the work around would have been to 
create MemoryUtils class under the io.netty.util.internal package. I didn't 
think this was a cleaner approach than just using Java's intrinsic APIs 
directly (note also that Java9 will expose such intrinsics under a public JDK 
package instead of the sun.misc. Please let me know if you rather prefer the 
io.netty.util.internal route instead.

- Changes to the Vector Code
I had few discussions with Paul about the need to reconcile DRILL-5657; the 
agreement was to proceed with the current implementation and then apply 
refactoring (future releases) as my changes are backward compatible (new public 
APIs added). 

- Arrow Integration
I will create a task to analyze the impact of the Arrow Integration on the 
Parquet Reader.

- UserBitShared.proto
This class was updated to deal with the implicit columns optimization. I 
have added two fields to the NullableVarChar class to capture the logical 
number of rows and the fact this feature was active. This allowed me to use 
different Accessor and Mutator implementation. You and I had a discussion about 
this and the agreement was to:
a) The new optimization is configurable (current default is off); though we 
might change the default at some point
b) Detect C++ based clients and turn of this optimization (that is, the 
implicit columns enhancements); I didn't implement this task yet as I forgot 
the name of the property that you have indicated (this should be quite fast to 
implement though)

- Testing
Yes, QA helped me to run the test-suite with the two new options enabled. 
The new changes didn't modify any of the encoding/compression logic but I'll be 
happy to add any test(s) you deem necessary.. 

- Code Style
Thank you for the pointers; I'll stick to your feedback in future changes 
to ease the review process.





---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158086368
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158085723
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158084806
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158080247
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
--- End diff --

same answer.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158080071
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
--- End diff --

This is not a problem as the for-loop condition (index = 0; index < 
lengthToProcess; ..) will prevent any processing to take place unless the 
correct condition is met. The key point here is that the input length is always 
provided by the "end - start" formula. 


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158074965
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
--- End diff --

These optimizations are useless with modern compilers. This would have been 
useful if the inputByte value was the same.. then we would save few 
instructions.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158074289
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
--- End diff --

HeapByteBuffer will always create an internal buffer. Empty string will 
have a byte array with zero bytes.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread sachouche

Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158072762
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
--- End diff --

will do.


---

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-13 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1072

DRILL-5879: Improved SQL Pattern Contains Performance

**BACKGROUND**
- JIRA [DRILL-5879](https://issues.apache.org/jira/browse/DRILL-5879) goal 
is to improve the "Contains" pattern performance
- [DRILL-5899](https://issues.apache.org/jira/browse/DRILL-5899) (sub-task) 
was created subsequently to avoid the ASCII / Unicode pre-processing overhead.
- This pull-request addresses the algorithmic part of this functionality

**ANALYSIS**
- Contains has O(n*m) complexity
- There are two ways to optimize the associated runtime
1) Minimize the number of instructions, pipelining, and CPU stalls
2) Use smarter algorithms (compared to the naive one); for example, the 
Boyer-Moore algorithm (which is implemented by several popular Open Source 
tools such as grep)

**IMPLEMENTATION**
Our approach contains both suggestions (listed in the analysis)
- Created five matchers that are chosen based on the pattern length
- The first three, are based on pattern 1) and target patterns with length 
[1..4[ 
- The forth one, has a similar runtime then the current implementation and 
targets patterns with length [4..10[
- We use the the Boyer-Moore algorithm for patterns with a length larger 
than 9 bytes
NOTE - the JDK doesn't use this algorithm because of two main  reasons
- Two extra arrays are necessary with a size relative to the supported 
character-set and the pattern length (this would be particularly costly for 
unicode as this would require 64k entries)
- Each Contains (or indexOf) invocation would require memory and 
initialization overhead
- Drill doesn't have this issue as a) initialization overhead is amortized 
since the pattern will be matched against many input values and b) our Contains 
logic is centered around bytes so the memory overhead is around 256 integers 
per fragment

**PERFORMANCE IMPROVEMENTS**
- We observe at least 25% improvement per Contains operation for matchers 
with pattern length lower than 4; 100% for negative cases (as the code never 
accesses a secondary for-loop)
- It was noticed the Boyer-Moor algorithm performs poorly for small 
patterns as lookup accesses erase the associated optimizations; this algorithm 
performs extremely well when the pattern length increases as unlike the naive 
implementation it is able to use the lookup table to jump beyond the next 
character (since the matching phase has already gained so insight).
- One popular introduction to this algorithm is the following use-case
- Input: 
- Pattern: aax



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill DRILL-5879

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1072


commit 50ac5140420057692cac8a33f181eb16580a28e6
Author: Salim Achouche <sachouc...@gmail.com>
Date:   2017-12-13T22:24:45Z

DRILL-5879: Improved SQL Pattern Contains Performance




---

[GitHub] drill pull request #1060: DRILL-5846: Improve parquet performance for Flat D...

2017-12-03 Thread sachouche

GitHub user sachouche opened a pull request:

https://github.com/apache/drill/pull/1060

DRILL-5846: Improve parquet performance for Flat Data Types

Performance improvements for the Parquet Scanner (Flat Data Types). The are 
two flags to control this performance enhancement (disabled by default):
Option I -
Config Name: store.parquet.flat.reader.bulk
Config Type  : boolean
Description   : Enables bulk processing to minimize memory checks and 
improve JVM HotSpot optimizations

Option II -
Config Name: scan.optimized.implicit.columns 
Config Type  : boolean
Description   : Memory optimization when storing duplicate value (implicit 
columns have duplicate values within a batch). Code profiling indicated this 
step represented one third of Parquet processing (when implicit columns are 
processed).  


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachouche/drill drill-parquet-improv

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1060.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1060


commit 0a3a8b053be85c570ff24237d4737f37668383bd
Author: Salim Achouche <sachouc...@gmail.com>
Date:   2017-12-04T01:53:08Z

DRILL-5846: Improve parquet performance for Flat Data Types




---

1 2 >

1 - 100 of 119 matches

Mail list logo