[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184596895
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java 
---
@@ -77,4 +83,46 @@ public long getByteCount() {
   public boolean isAckSent() {
 return ackSent.get();
   }
+
+  /**
+   * Transfer ownership of this DrillBuf to the target allocator. This is 
done for better memory
+   * accounting (that is, the operator should be charged with the body's 
Drillbuf memory).
+   *
+   * NOTES -
+   * 
+   * This operation is a NOOP when a) the current allocator 
(associated with the DrillBuf) is not the
+   * owning allocator or b) the target allocator is already the owner
+   * When transfer happens, a new RawFragmentBatch instance is 
allocated; this is done for proper
+   * DrillBuf reference count accounting
+   * The RPC handling code caches a reference to this RawFragmentBatch 
object instance; release()
+   * calls should be routed to the previous DrillBuf
+   * 
+   *
+   * @param targetAllocator target allocator
+   * @return a new {@link RawFragmentBatch} object instance on success 
(where the buffer ownership has
+   * been switched to the target allocator); otherwise this 
operation is a NOOP (current instance
+   * returned)
+   */
+  public RawFragmentBatch transferBodyOwnership(BufferAllocator 
targetAllocator) {
+if (body == null) {
+  return this; // NOOP
+}
+
+if (!body.getLedger().isOwningLedger()
+ || body.getLedger().isOwner(targetAllocator)) {
+
+  return this;
+}
+
+int writerIndex   = body.writerIndex();
+TransferResult transferResult = 
body.transferOwnership(targetAllocator);
+
+// Set the index and increment reference count
+transferResult.buffer.writerIndex(writerIndex);
+
+// Clear the current Drillbuffer since caller will perform release() 
on the new one
+body.release();
+
+return new RawFragmentBatch(getHeader(), transferResult.buffer, 
getSender(), false);
--- End diff --

This actually brings a question why `newRawFragmentBatch` is released in 
`IncomingBuffers.batchArrived()` instead of releasing `transferredBuffer` after 
`RawFragmentBatch` is constructed in `newRawFragmentBatch`.


---


[jira] [Created] (DRILL-6362) typeof() lies about types

2018-04-26 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6362:
--

 Summary: typeof() lies about types
 Key: DRILL-6362
 URL: https://issues.apache.org/jira/browse/DRILL-6362
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.13.0
Reporter: Paul Rogers


Drill provides a {{typeof()}} function that returns the type of a column. But, 
it seems to make up types. Consider the following input file:

{noformat}
{a: true}
{a: false}
{a: null}
{noformat}

Consider the following two queries:
{noformat}
SELECT a FROM `json/boolean.json`;
++
|   a|
++
| true   |
| false  |
| null   |
++
> SELECT typeof(a) FROM `json/boolean.json`;
+-+
| EXPR$0  |
+-+
| BIT |
| BIT |
| NULL|
+-+
{noformat}

Notice that the values are reported as BIT. But, I believe the actual type is 
UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
NULL instead of the actual type for the null value.

Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
to muddle the type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6361) Provide a dataTypeOf() or modeOf() function

2018-04-26 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6361:
--

 Summary: Provide a dataTypeOf() or modeOf() function
 Key: DRILL-6361
 URL: https://issues.apache.org/jira/browse/DRILL-6361
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.13.0
Reporter: Paul Rogers


Drill provides a {{typeof()}} function to return the type of a column. The 
returned string, however, has only the base data type. A Drill data type (a 
"major type") also includes a cardinality (a "mode"). For example, {{Optional 
Int}} or {{Required VarChar}}.

This type information is useful for handling data conversions. For example, if 
I could tell that a column value was a {{Nullable Int}}, I could guess that it 
is one Drill invented, and I could merge it, by hand, with the type from 
another file that had actual values.

The two options are equivalent. Either provide a {{modeOf()}} to just return 
cardinality, or a {{dataTypeOf()}} that returns both. (Maybe the {{modeOf()}} 
might be more useful.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6360) Document the typeof() function

2018-04-26 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6360:
--

 Summary: Document the typeof() function
 Key: DRILL-6360
 URL: https://issues.apache.org/jira/browse/DRILL-6360
 Project: Apache Drill
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.13.0
Reporter: Paul Rogers
Assignee: Bridget Bevens


Drill has a {{typeof()}} function that returns the data type (but not mode) of 
a column. It was discussed on the dev list recently. However, a search of the 
Drill web site, and a scan by hand, failed to turn up documentation about the 
function.

As a general suggestion, would be great to have an alphabetical list of all 
functions so we don't have to hunt all over the site to find which functions 
are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-26 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184590500
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
*/
   private int netRowWidth;
   private int netRowWidthCap50;
+
+  /**
+   * actual row size if input is not empty. Otherwise, standard size.
+   */
+  private int rowAllocSize;
--- End diff --

@ppadma, not much more to add. If the code requires you do estimates based 
on no information, we won't get very good estimates. But, if we know we have 0 
rows, then that itself is a good estimate of the size we'll need.

If there is a way to improve the estimate, I'm guessing you'll find it as 
work proceeds and you seem more examples and test cases. 


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184590436
  
--- Diff: 
exec/memory/base/src/main/java/org/apache/drill/exec/memory/AllocationManager.java
 ---
@@ -253,10 +261,12 @@ public boolean transferBalance(final BufferLedger 
target) {
   target.historicalLog.recordEvent("incoming(from %s)", 
owningLedger.allocator.name);
 }
 
-boolean overlimit = target.allocator.forceAllocate(size);
+// Release first to handle the case where the current and target 
allocators were part of the same
+// parent / child tree.
 allocator.releaseBytes(size);
+boolean allocationFit = target.allocator.forceAllocate(size);
--- End diff --

In this case, changing the order of `forceAllocate()` and `releaseBytes()` 
is incorrect as ownership is not changed, but the old owner does not account 
for that memory anymore.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184589826
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -201,6 +208,11 @@ public IterOutcome next() {
   context.getExecutorState().fail(ex);
   return IterOutcome.STOP;
 } finally {
+
+  if (batch != null) {
+batch.release();
+batch = null;
--- End diff --

OK, but note that additions to the current code can also delete `batch = 
null` assignment and that `batch` can be used after `release()` call. For 
example, with the current implementation, `batch.getHeader()` is perfectly 
valid after the batch was released.


---


[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-26 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1237
  
IMO, it is better not to report memory usage at all compared to reporting a 
wrong number. In case incoming batches are accumulated in a queue, they should 
be reported as owned by a receiver. Taking ownership just before passing a 
batch to the next operator does not sound right to me.

I don't think it is necessary to create new fragment child allocator. 
Receiver allocator should be used instead of fragment allocator when an 
incoming batch is placed into a queue.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184586222
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
--- End diff --

No, there is no need to handle `IOException` twice. There are no other 
methods that throw `IOException`. `SchemaChangeException` is actually never 
thrown. Please see *TODO* comment and DRILL-2933.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184585959
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
--- End diff --

Please explain why it can't be released. Note that the release is done for 
the batch that will *not* be returned from the method.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184585775
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

Yes, it does matter. If something goes wrong during `startWait()` there 
should be no call to `stopWait()`.


---


[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...

2018-04-26 Thread sachouche
Github user sachouche commented on the issue:

https://github.com/apache/drill/pull/1237
  
@vrozov, your observation is valid, we need more JIRAs to fix the reporting 
problem
**Current Fix**
- At this time, the UnorderedReceiver didn't account for any consumed memory
- This fix, taxes the operator only when it consumes buffers

**Potential Enhancements**
Solution I
- Create a new fragment child Allocator which will own the received (not 
yet consumed) batches
- Improve the UI to report this allocator size
- This solution is simple and complementary to the current work

Solution II
- Have the receiver operator drain the batch queue
- Essentially, the receiver will have a private queue from where to consume 
batches
- I personally don't like this solution as it prevents us from improving 
the Drill network protocol
- Draining the batches for the sake of reporting will make it harder for 
the network layer to prefetch batches in an optimal manner; the queue size 
should be an indicator on how many pending batches there are. 


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184572864
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
--- End diff --

why is that? this method is used once and doing so will make us duplicate 
exception handling code. I would appreciate if you provide reason(s) when you 
make suggestions to avoid the back and forth.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184575686
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
+  && (!first || batch.getHeader().getDef().getFieldCount() == 0)) {
+batch = getNextBatch();
+  }
+} finally {
+  stats.stopWait();
+}
+return batch;
+  }
+
   @Override
   public IterOutcome next() {
 batchLoader.resetRecordCount();
 stats.startProcessing();
-try{
-  RawFragmentBatch batch;
-  try {
-stats.startWait();
-batch = getNextBatch();
 
-// skip over empty batches. we do this since these are basically 
control messages.
-while (batch != null && 
batch.getHeader().getDef().getRecordCount() == 0
-&& (!first || batch.getHeader().getDef().getFieldCount() == 
0)) {
-  batch = getNextBatch();
-}
-  } finally {
-stats.stopWait();
-  }
+RawFragmentBatch batch = null;
+try {
 
+  batch = getNextNotEmptyBatch();
   first = false;
 
   if (batch == null) {
--- End diff --

Sure.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184574622
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -201,6 +208,11 @@ public IterOutcome next() {
   context.getExecutorState().fail(ex);
   return IterOutcome.STOP;
 } finally {
+
+  if (batch != null) {
+batch.release();
+batch = null;
--- End diff --

This pattern ensures that after release the batch cannot be accessed 
anymore (e.g., additions to the current code).


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184574405
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
--- End diff --

Sure. 

FYI - We cannot release the batch within this method.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread sachouche
Github user sachouche commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184573646
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

Sure, but does it really matter or is it your personal preference?


---


[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...

2018-04-26 Thread jiang-wu
Github user jiang-wu commented on the issue:

https://github.com/apache/drill/pull/1184
  
@parthchandra Just to clarify on the JDBC comment.  What do you mean by 
"Json representation"? Do you instead mean the "Local[Date|Time]" class 
representation?  There are no "Json" being returned from the JDBC layer.  It 
uses Java collections Map or List objects.  Inside the Map | List, the change 
in this pull request properly uses objects of different classes: Local 
[Date|Time|DateTime] to represent the various date/time/timestamp values.

So far so good.  Now, it is possible in the future, we may want to further 
translate the Local [Date|Time|DateTime] objects inside the Map|List to 
java.sql.[Date|Time|Timestamp] upon access.  But to do that inside the 
SqlAccessor, you would need to deep copy the Map|List and build another version 
with the date|time translated into java.sql.date|time.  That would seem like a 
lot of work for little gain. 

I would say let's hold off on that for now.  A few databases seem to be 
moving toward using non-timezone based representation in JDBC if the database 
does not support timezones: 
https://jdbc.postgresql.org/documentation/head/8-date-time.html  It would make 
sense to consider changing the class used after deciding on what to do with 
Drill handling of timezones. 


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184554299
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
--- End diff --

Consider handling `IOException` inside `getNextNotEmptyBatch()`. 


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184558425
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
+  batch = getNextBatch();
+
+  // skip over empty batches. we do this since these are basically 
control messages.
+  while (batch != null && batch.getHeader().getDef().getRecordCount() 
== 0
--- End diff --

Consider reverting the condition, something like 
```
  while (true) {
RawFragmentBatch batch = getNextBatch();
if (batch == null) {
  break;
}
RecordBatchDef recordBatchDef = batch.getHeader().getDef();
if (recordBatchDef.getRecordCount() > 0 || (first && 
recordBatchDef.getFieldCount() > 0)) {
  return batch;
}
batch.release();
  }
```


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184554436
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws 
IOException {
 }
   }
 
+  private RawFragmentBatch getNextNotEmptyBatch() throws IOException {
+RawFragmentBatch batch;
+try {
+  stats.startWait();
--- End diff --

Move outside of try/finally.


---


[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1237#discussion_r184559429
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java
 ---
@@ -201,6 +208,11 @@ public IterOutcome next() {
   context.getExecutorState().fail(ex);
   return IterOutcome.STOP;
 } finally {
+
+  if (batch != null) {
+batch.release();
+batch = null;
--- End diff --

As far as I can see setting `batch` to `null` has no impact here.


---


[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...

2018-04-26 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1184
  
Putting aside the discussion on date/time/timezone for the moment, 
@jiang-wu let's say getObject returns to you an object that implements 
java.sql.{Struct|Array}. You now use the Struct|Array apis to get the attribute 
you are interested in. If the attribute is of type date|time the object 
returned for that attribute should now correspond to java.sql.{Date|Time} 
instead of the Json representation. Will that not address your requirement?



---


[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...

2018-04-26 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1144
  
Looks good. +1


---


[GitHub] drill pull request #1234: DRILL-5927: Fixed memory leak in TestBsonRecordRea...

2018-04-26 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/1234#discussion_r184549758
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/store/bson/TestBsonRecordReader.java
 ---
@@ -45,21 +47,24 @@
 import org.bson.BsonTimestamp;
 import org.bson.BsonWriter;
 import org.bson.types.ObjectId;
-import org.junit.AfterClass;
-import org.junit.BeforeClass;
+import org.junit.After;
+import org.junit.Before;
 import org.junit.Test;
 
-public class TestBsonRecordReader extends BaseTestQuery {
-  private static VectorContainerWriter writer;
-  private static TestOutputMutator mutator;
-  private static BsonRecordReader bsonReader;
+public class TestBsonRecordReader {
+  private BufferAllocator allocator;
+  private VectorContainerWriter writer;
+  private TestOutputMutator mutator;
+  private BufferManager bufferManager;
+  private BsonRecordReader bsonReader;
 
-  @BeforeClass
-  public static void setUp() {
-BufferAllocator bufferAllocator = getDrillbitContext().getAllocator();
-mutator = new TestOutputMutator(bufferAllocator);
+  @Before
+  public void setUp() {
+allocator = new RootAllocator(Long.MAX_VALUE);
--- End diff --

Any reason why you had to change to RootAllocator? Should not really be 
necessary.


---


[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...

2018-04-26 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1184
  
@parthchandra, the point about the birthday is that is is one of those 
dates that is implied relative to where you are. You celebrate it the same day 
regardless of where you are in the world. Same with an order date. So, a key 
problem is that, for dates, they are not relative to UTC, they are just dates. 
They become relative to UTC only when a time and timezone is supplied.

As @jiang-wu explained, storing in UTC is fine when times are absolute 
(date + time + tz). The problem is "2018-04-15" or even "2018-04-15 3 PM" is 
not an absolute: it is local and cannot be stored as UTC unless we know the TZ. 
Guessing that the TZ is that of the server really does not help, and actually 
produces wrong results when client and server timezones differ.

That's why the data structures need to support the data model @jiang-wu 
suggested:

* Date without TZ
* Time without a TZ
* Date/time without TZ, and
* Timestamp implied in UTC.

And, yes, it is because people abused the Java `Date` class that the Joda 
time classes were invented. We just need to have Drill types that parallel the 
Joda types. Granted, this is more than this fix can tackle, but the point 
stands. 

Agreed that the issue of how to handle JDBC/ODBC needs to be resolved. Can 
we make up synthetic column names? Implicitly flatten the results so that 
"context.date" will pick out the "date" element within "context". This will 
allow JDBC to provide metadata and a reasonable type for each column, at the 
cost of potentially creating a very wide row (if you have deeply nested maps.) 
The "auto-flatten" option seems cleaner than any object-based format we make up.

A related solution is to fix `sqlline` so that it has a formatter other 
than `toString()` as @jiang-wu suggested. We register a format class and 
`sqlline` uses that to format, say, a Drill Map. That way we don't have to use 
a JSON object so that its `toString()` produces nice output. 


---


[GitHub] drill issue #1232: DRILL-6094: Decimal data type enhancements

2018-04-26 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1232
  
+1 for the C++ part. Looks really good. 


---


[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...

2018-04-26 Thread jiang-wu
Github user jiang-wu commented on the issue:

https://github.com/apache/drill/pull/1184
  
Actually, JDBC representation is not he hard problem here.  I ran into most 
of the problems dealing with the timezones surrounding the data|time|timestamp. 
 java.sql.Struct and Array are interfaces and not actual classes.  So the 
Current JDBC returning Map|List object for complex values are fine.  You can 
just declare JsonStringHashMap implements Struct, and JsonStringArryaList 
implements Array.  

Now the harder issue.  The semantics of "date", "time" are tricker 
comparing to "timestamp".  Timestamp is understood to be an instant in time 
(java/joda Instant class).  Timestamp is a single point in time and not 
impacted by time zones.

date and time can have two uses:

1) logical date and time, which is not fixed point or range in time.  e.g. 
2018-01-01 is the new year day and this day happens for a different 24 hour 
window depending on where your time zone is.  So this is a logical date, and we 
don't celebrate the start of the New Year day at the same time.

2) date and time with offset/timezone -- This refers to a specific point or 
range in time.  This type of date/time is absolute.  e.g. NYSE opens on "7:30 
am EST".  Regardless of the timezone you are at, this time is the same for 
everyone.

joda/java8 time package have the proper handling for 1) logical date, time 
(using local date time classes); 2) absolute date, time (using offset date time 
classes); and 3) timestamp (instant class)  various method exists for you to 
apply conversions between these based on the heuristics you want to apply. 

I feel for Drill, we need to first decide what behavior do we want to 
support first.  Then go from there.



---


[GitHub] drill issue #1227: DRILL-6236: batch sizing for hash join

2018-04-26 Thread ilooner
Github user ilooner commented on the issue:

https://github.com/apache/drill/pull/1227
  
@ppadma Please fix travis failure


---


[GitHub] drill issue #1232: DRILL-6094: Decimal data type enhancements

2018-04-26 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1232
  
I would recommend we wait till we have all testing complete, since this is 
a big feature.


---


[GitHub] drill issue #1214: DRILL-6331: Revisit Hive Drill native parquet implementat...

2018-04-26 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1214
  
Looks Good. Thanks for making the changes Arina.
+1


---


[GitHub] drill pull request #1225: DRILL-6272: Refactor dynamic UDFs and function ini...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1225#discussion_r184411700
  
--- Diff: exec/java-exec/pom.xml ---
@@ -593,6 +593,48 @@
   netty-tcnative
   ${netty.tcnative.classifier}
 
+
+  org.apache.maven
+  maven-embedder
+  3.3.9
--- End diff --

Consider using the latest release available.


---


[GitHub] drill pull request #1225: DRILL-6272: Refactor dynamic UDFs and function ini...

2018-04-26 Thread vrozov
Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1225#discussion_r184245358
  
--- Diff: pom.xml ---
@@ -798,7 +798,7 @@
   
   com.googlecode.jmockit
   jmockit
-  1.3
+  1.7
--- End diff --

Can it be done as a precursor PR? 1.7 version is quite old too. Can it be 
upgraded to the latest (org.jmockit:jmockit:1.39)?


---


[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...

2018-04-26 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1184
  
>  But, if April 15 is your birthday, it is your birthday in all timezones. 
We don't say your birthday (or order date, or newspaper issue date or...) is 
one day in, say London and another day in Los Angeles.

If it is your birthday in California, it may already be the day after your 
birthday in Japan. :)

IMO, Representing dates, times, and timestamp's as UTC is not the problem. 
It is, in fact, perfectly correct (since UTC is the timezone). Converting a 
date|time|timestamp without a timezone to/from UTC, is the problem. The problem 
is made worse by java.util and JDBC APIs. java.time gets it right though.

However, as Jiang-wu points out, that still does not address the mismatch 
between Joda/Java8 representation and JDBC. It also does not address his 
original problem, the issue of how to represent a complex type in JDBC; just 
return an Object, it says, which is no help at all . It is even worse for ODBC 
which (last I checked) did not even have an API to return an Object type (which 
is why in ODBC we return a JSON string representation). 
For Jiang-wu's use case, since the string representation is not enough, we 
might look at returning a java.sql.Struct [1] type for Maps and java.sql.Array 
[2] types.

[1] https://docs.oracle.com/javase/7/docs/api/java/sql/Struct.html
[2] https://docs.oracle.com/javase/7/docs/api/java/sql/Array.html






---


Re: Display column data type without code

2018-04-26 Thread Paul Rogers
Thanks!

A it turns out, typeof() is not documented on the Apache Drill website which is 
why I was confused. Maybe we should document it.

Just tried typeof(). Turns out it only returns the type, not the nullability or 
repeated mode. Is there a separate function to get the "mode" information?

- Paul

 

On Wednesday, April 25, 2018, 10:09:34 PM PDT, Aman Sinha 
 wrote:  
 
 You can do it through SQL using typeof() function.  Since there is no
global schema, Drill evaluates this for each row.

0: jdbc:drill:drillbit=10.10.101.41> select n_name, typeof(n_name) as
name_type, n_nationkey, typeof(n_nationkey) as nationkey_type from
cp.`tpch/nation.parquet` limit 2;

*+++--+-+*

*| **  n_name  ** | **name_type ** | **n_nationkey ** | **nationkey_type **
|*

*+++--+-+*

*| *ALGERIA  * | *VARCHAR  * | *0          * | *INT            * |*

*| *ARGENTINA * | *VARCHAR  * | *1          * | *INT            * |*

*+++--+-+*


On Wed, Apr 25, 2018 at 9:22 PM, Paul Rogers 
wrote:

> Hi All,
> Anyone know if there is a non-code way to display the data types of
> columns returned from a Drill query? Sqlline appears to only show the
> column names and values. The same is true of the Drill web console.
> The EXPLAIN PLAN FOR ... command shows the query plan, but not type (which
> are only known at run time.) Is there a statement, system table or some
> other trick to display column types in, say, Sqlline?
> In the past, I've gotten the types by using unit test style code. But,
> that is not to handy for use as an example for non-developers...
> Thanks,
> - Paul
>
>
  

[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-26 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184483395
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
*/
   private int netRowWidth;
   private int netRowWidthCap50;
+
+  /**
+   * actual row size if input is not empty. Otherwise, standard size.
+   */
+  private int rowAllocSize;
--- End diff --

@paul-rogers yes, this PR is trying to do the second thing i.e. make best 
guess with no knowledge at all about lengths. Also, right side can produce no 
rows for reasons other than empty array i.e. for ex. we might be filtering 
everything out after unnest.  
Please let me know what you think about the approach.


---


Re: Display column data type without code

2018-04-26 Thread Rob Wu
Hi Paul,

You could also use DESCRIBE (https://drill.apache.org/docs/describe/).

0: jdbc:drill:drillbit=localhost:31010> describe
`hive.default`.`integer_table`
. . . . . . . . . . . . . . . . . . . > ;
+--++--+
| COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
+--++--+
| keycolumn| CHARACTER VARYING  | YES  |
| column1  | INTEGER| YES  |
+--++

Best regards,

Rob

On Wed, Apr 25, 2018 at 10:12 PM, Abhishek Girish 
wrote:

> Hey Paul,
>
> You could use the typeof() function for this purpose. It takes a single
> parameter - the column name.
>
> For example:
> > select typeof(c_current_cdemo_sk) from customer limit 1;
> +-+
> | EXPR$0  |
> +-+
> | BIGINT  |
> +-+
> 1 row selected (0.472 seconds)
>
>
> On Wed, Apr 25, 2018 at 9:23 PM Paul Rogers 
> wrote:
>
> > Hi All,
> > Anyone know if there is a non-code way to display the data types of
> > columns returned from a Drill query? Sqlline appears to only show the
> > column names and values. The same is true of the Drill web console.
> > The EXPLAIN PLAN FOR ... command shows the query plan, but not type
> (which
> > are only known at run time.) Is there a statement, system table or some
> > other trick to display column types in, say, Sqlline?
> > In the past, I've gotten the types by using unit test style code. But,
> > that is not to handy for use as an example for non-developers...
> > Thanks,
> > - Paul
> >
> >
>


[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable

2018-04-26 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1238
  
There is not enough info available to debug and/or troubleshoot DRILL-5908 
and I prefer instead of trying to find bugs in homegrown solution replace it 
with Java out of the box functionality and at the same time provide an ability 
to log enough information to do RCA for DRILL-5908.

IMO, there are no unreasonable requests on PR/JIRA 😄. 


---


[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable

2018-04-26 Thread parthchandra
Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/1238
  
Fair enough. But that *still* does not give me a clue about the problem(s) 
you were trying to fix, or how the refactoring helps. Is the cause of the 
problem in TimedRunnable? 
Not an unreasonable request is it?



---


[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable

2018-04-26 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1238
  
I did not change how tasks (`Runnable` or `Callable`) behave and did not 
look into converting `Callable/Runnable` to a `ForkJoinTask`. Whether existing 
tasks can be scheduled recursively or not depends on the nature of those tasks 
and is not the scope of this PR. I'd suggest filing a JIRA if one does not 
exist already to look into `Fork/Join` (this is what I would expect from the 
developer who put "TODO").


---


[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable

2018-04-26 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1238
  
The step is necessary to do RCA for DRILL-5908. There are way too many 
issues with the current implementation to list them in JIRA or PR and the major 
issue is the usage of homegrown solutions where Java (or other 3rd party 
libraries) already provides a required functionality out of the box. There is 
no need to use `Runnable` instead of `Callable` and provide custom `Callable` 
functionality. It is not necessary to wait on a `CountDownLatch` when 
`ExecutionService` provides the ability to invoke all tasks and return results 
when all tasks complete or a timeout expires.


---


[jira] [Resolved] (DRILL-4156) Parquet group converter cannot cope with ENUM original type values

2018-04-26 Thread Arina Ielchiieva (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva resolved DRILL-4156.
-
Resolution: Fixed

Fixed in DRILL-5971.

> Parquet group converter cannot cope with ENUM original type values 
> ---
>
> Key: DRILL-4156
> URL: https://issues.apache.org/jira/browse/DRILL-4156
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Chris Jansen
>Priority: Major
>  Labels: easyfix, newbie, patch
> Fix For: 1.13.0
>
> Attachments: parquet_enum.patch, parquet_enum_v2.patch
>
>
> When importing parquet data written using a Thrift schema, Drill cannot cope 
> with ENUMs. This was fixed elsewhere in DRILL-1775 but not in the 
> DrillParquetGroupConverter class ([see mailing list 
> thread|https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAK55V=+VjPAehMiJ8U=crh182k4zvhgfxzfwznaa9edwzwg...@mail.gmail.com%3E]).
> It appears that the ParquetToDrillTypeConverter class was updated to use a 
> varbinary as it's default behaviour, but the DrillParquetGroupConverter class 
> was not updated to do the same. 
> {code}
> Caused by: java.lang.UnsupportedOperationException: Unsupported type ENUM
>   at 
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.getConverterForType(DrillParquetGroupConverter.java:249)
>   at 
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:154)
>   at 
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:147)
>   at 
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:147)
>   at 
> org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:147)
>   at 
> org.apache.drill.exec.store.parquet2.DrillParquetRecordMaterializer.(DrillParquetRecordMaterializer.java:40)
>   at 
> org.apache.drill.exec.store.parquet2.DrillParquetReader.setup(DrillParquetReader.java:267)
>   ... 14 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...

2018-04-26 Thread jiang-wu
Github user jiang-wu commented on the issue:

https://github.com/apache/drill/pull/1184
  
Yes.  There are at least two issues.  

One is about how Drill represent Date, Time, Timestamp internally using a 
UTC based instant representation and fudges the timezone in order to make the 
none time zone fields looking right, that Paul outlined nicely above.  To 
really fix this, one would need to first define how Drill wishes to handle 
Date, Time, Timestamp: e.g. no time zone at all, time zone aware but not 
preserving, time zone aware and preserving, etc.  One can look at how databases 
handle time zones to get some inspirations.  This part is too ambitious for me 
to fix here.

The second is to obtain a complex object value from JDBC interface.  Drill 
doesn't make a JSON object in order to send the data to a JDBC interface.  It 
looks like the JDBC interface is simple an alternative accessor to vector data 
being transmitted from the server side to the client side.  Once the vector 
data arrives on the client side, the JDBC layer builds a Map (or List) object 
by reading the values from the vectors.  The issue I found was that inside this 
process, date|time|timestamp values from their respective vector classes were 
all represented using the same Java class, hence losing its type information 
all together when the Java object is placed inside the Map|List.

So a fix for this part is simple (in concept), just use different Java 
types.  Except, when making this change, the first issue popped up as one has 
to figure out how to make Date and Time out of UTC instant values, which are 
these fudged values from the existing logic


---


[GitHub] drill issue #1214: DRILL-6331: Revisit Hive Drill native parquet implementat...

2018-04-26 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1214
  
@parthchandra added two new commits:
1. reverted custom Stopwatch implementation and used logger checks instead.
2. allowed to create several non-tracking fs but only one tracking per 
operator context (details in prior discussion).


---


[GitHub] drill pull request #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] cl...

2018-04-26 Thread jiang-wu
Github user jiang-wu commented on a diff in the pull request:

https://github.com/apache/drill/pull/1184#discussion_r184424638
  
--- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java ---
@@ -509,15 +509,15 @@ public long getTwoAsLong(int index) {
 public ${friendlyType} getObject(int index) {
   org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), 
org.joda.time.DateTimeZone.UTC);
   date = 
date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault());
-  return date;
+  return new java.sql.Date(date.getMillis());
--- End diff --

The string representation between LocalDateTime and Timestamp are not 
exactly the same, but that is potentially fixable since we can alter the way 
the values are displayed via formatters.  Though JDBC is not just for getting 
string representations.  It is on the programmatic use cases where we are 
getting the value objects where one would see the disconnect on the data types. 
 Will try this out with a use case I have with programmatic JDBC access and see 
what are the impacts on different types for the same expected value.


---


[GitHub] drill pull request #1214: DRILL-6331: Revisit Hive Drill native parquet impl...

2018-04-26 Thread arina-ielchiieva
Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/1214#discussion_r184401600
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/BaseOperatorContext.java 
---
@@ -158,25 +159,26 @@ public void close() {
 } catch (RuntimeException e) {
   ex = ex == null ? e : ex;
 }
-try {
-  if (fs != null) {
+
+for (DrillFileSystem fs : fileSystems) {
+  try {
 fs.close();
-fs = null;
-  }
-} catch (IOException e) {
+  } catch (IOException e) {
   throw UserException.resourceError(e)
-.addContext("Failed to close the Drill file system for " + 
getName())
-.build(logger);
+  .addContext("Failed to close the Drill file system for " + 
getName())
+  .build(logger);
+  }
 }
+
 if (ex != null) {
   throw ex;
 }
   }
 
   @Override
   public DrillFileSystem newFileSystem(Configuration conf) throws 
IOException {
-Preconditions.checkState(fs == null, "Tried to create a second 
FileSystem. Can only be called once per OperatorContext");
-fs = new DrillFileSystem(conf, getStats());
+DrillFileSystem fs = new DrillFileSystem(conf, getStats());
--- End diff --

When `AbstractParquetScanBatchCreator.getBatch` method is called, it 
receives one operator context which is used to allow to create only one file 
system. It also receives `AbstractParquetRowGroupScan` which contains several 
row groups. Row groups may belong to different files. For Drill parquet files, 
we create only one fs and use it for to create readers for each row group. 
That's why it was fine when operator context allowed to create only one fs. But 
we needed to adjust it for Hive files. For Hive we need to create fs for each 
file (since config to each file system is different and created using 
projection pusher), that's why I had to change operator context to allow more 
then one file system. I have also introduced `AbstractDrillFileSystemManager` 
which controls number of file systems created. `ParquetDrillFileSystemManager` 
creates only one (as was done before). 
`HiveDrillNativeParquetDrillFileSystemManager` creates fs for each file, so 
when two row groups belong to the same
  file, they will get the same fs.

But I agree that for tracking fs (i.e. 
store.parquet.reader.pagereader.async is set to false) this will create mess in 
calculations. So I suggest the following fix, for Hive we'll always create non 
tracking fs, for Drill depending on store.parquet.reader.pagereader.async 
option. Also I'll add checks in operator context to disallow to create more 
then one tracking fs and to create tracking fs at all when non-tracking is / 
are already created.


---


[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...

2018-04-26 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/1184
  
One additional note. We noted that JDBC does not support the idea of a 
nested tuple (a Drill "map".) JDBC does support columns that return a Java 
object. To bridge the gap, Drill returns a Map column as a Java object.

But, why a JSON object? The answer seems to lie with the `sqline` program. 
If we query a one-line JSON file with a nested object in `sqlline`, we get the 
following display:

```
SELECT * FROM `json/nested.json`;
+-+--+
| custId  |   name   |
+-+--+
| 101 | {"first":"John","last":"Smith"}  |
+-+--+
```

So, it seems likely that the the value of a Map object was translated to a 
JSON object so that when `sqlline` calls `toString()` on it, it ends up 
formatted nicely as above.

Because of this, it may be hard to change the kind of objects returned from 
JDBC for a Map column.


---