[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184596895 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RawFragmentBatch.java --- @@ -77,4 +83,46 @@ public long getByteCount() { public boolean isAckSent() { return ackSent.get(); } + + /** + * Transfer ownership of this DrillBuf to the target allocator. This is done for better memory + * accounting (that is, the operator should be charged with the body's Drillbuf memory). + * + * NOTES - + * + * This operation is a NOOP when a) the current allocator (associated with the DrillBuf) is not the + * owning allocator or b) the target allocator is already the owner + * When transfer happens, a new RawFragmentBatch instance is allocated; this is done for proper + * DrillBuf reference count accounting + * The RPC handling code caches a reference to this RawFragmentBatch object instance; release() + * calls should be routed to the previous DrillBuf + * + * + * @param targetAllocator target allocator + * @return a new {@link RawFragmentBatch} object instance on success (where the buffer ownership has + * been switched to the target allocator); otherwise this operation is a NOOP (current instance + * returned) + */ + public RawFragmentBatch transferBodyOwnership(BufferAllocator targetAllocator) { +if (body == null) { + return this; // NOOP +} + +if (!body.getLedger().isOwningLedger() + || body.getLedger().isOwner(targetAllocator)) { + + return this; +} + +int writerIndex = body.writerIndex(); +TransferResult transferResult = body.transferOwnership(targetAllocator); + +// Set the index and increment reference count +transferResult.buffer.writerIndex(writerIndex); + +// Clear the current Drillbuffer since caller will perform release() on the new one +body.release(); + +return new RawFragmentBatch(getHeader(), transferResult.buffer, getSender(), false); --- End diff -- This actually brings a question why `newRawFragmentBatch` is released in `IncomingBuffers.batchArrived()` instead of releasing `transferredBuffer` after `RawFragmentBatch` is constructed in `newRawFragmentBatch`. ---
[jira] [Created] (DRILL-6362) typeof() lies about types
Paul Rogers created DRILL-6362: -- Summary: typeof() lies about types Key: DRILL-6362 URL: https://issues.apache.org/jira/browse/DRILL-6362 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.13.0 Reporter: Paul Rogers Drill provides a {{typeof()}} function that returns the type of a column. But, it seems to make up types. Consider the following input file: {noformat} {a: true} {a: false} {a: null} {noformat} Consider the following two queries: {noformat} SELECT a FROM `json/boolean.json`; ++ | a| ++ | true | | false | | null | ++ > SELECT typeof(a) FROM `json/boolean.json`; +-+ | EXPR$0 | +-+ | BIT | | BIT | | NULL| +-+ {noformat} Notice that the values are reported as BIT. But, I believe the actual type is UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports NULL instead of the actual type for the null value. Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} to muddle the type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6361) Provide a dataTypeOf() or modeOf() function
Paul Rogers created DRILL-6361: -- Summary: Provide a dataTypeOf() or modeOf() function Key: DRILL-6361 URL: https://issues.apache.org/jira/browse/DRILL-6361 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.13.0 Reporter: Paul Rogers Drill provides a {{typeof()}} function to return the type of a column. The returned string, however, has only the base data type. A Drill data type (a "major type") also includes a cardinality (a "mode"). For example, {{Optional Int}} or {{Required VarChar}}. This type information is useful for handling data conversions. For example, if I could tell that a column value was a {{Nullable Int}}, I could guess that it is one Drill invented, and I could merge it, by hand, with the type from another file that had actual values. The two options are equivalent. Either provide a {{modeOf()}} to just return cardinality, or a {{dataTypeOf()}} that returns both. (Maybe the {{modeOf()}} might be more useful.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6360) Document the typeof() function
Paul Rogers created DRILL-6360: -- Summary: Document the typeof() function Key: DRILL-6360 URL: https://issues.apache.org/jira/browse/DRILL-6360 Project: Apache Drill Issue Type: Improvement Components: Documentation Affects Versions: 1.13.0 Reporter: Paul Rogers Assignee: Bridget Bevens Drill has a {{typeof()}} function that returns the data type (but not mode) of a column. It was discussed on the dev list recently. However, a search of the Drill web site, and a scan by hand, failed to turn up documentation about the function. As a general suggestion, would be great to have an alphabetical list of all functions so we don't have to hunt all over the site to find which functions are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184590500 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) { */ private int netRowWidth; private int netRowWidthCap50; + + /** + * actual row size if input is not empty. Otherwise, standard size. + */ + private int rowAllocSize; --- End diff -- @ppadma, not much more to add. If the code requires you do estimates based on no information, we won't get very good estimates. But, if we know we have 0 rows, then that itself is a good estimate of the size we'll need. If there is a way to improve the estimate, I'm guessing you'll find it as work proceeds and you seem more examples and test cases. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184590436 --- Diff: exec/memory/base/src/main/java/org/apache/drill/exec/memory/AllocationManager.java --- @@ -253,10 +261,12 @@ public boolean transferBalance(final BufferLedger target) { target.historicalLog.recordEvent("incoming(from %s)", owningLedger.allocator.name); } -boolean overlimit = target.allocator.forceAllocate(size); +// Release first to handle the case where the current and target allocators were part of the same +// parent / child tree. allocator.releaseBytes(size); +boolean allocationFit = target.allocator.forceAllocate(size); --- End diff -- In this case, changing the order of `forceAllocate()` and `releaseBytes()` is incorrect as ownership is not changed, but the old owner does not account for that memory anymore. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184589826 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -201,6 +208,11 @@ public IterOutcome next() { context.getExecutorState().fail(ex); return IterOutcome.STOP; } finally { + + if (batch != null) { +batch.release(); +batch = null; --- End diff -- OK, but note that additions to the current code can also delete `batch = null` assignment and that `batch` can be used after `release()` call. For example, with the current implementation, `batch.getHeader()` is perfectly valid after the batch was released. ---
[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1237 IMO, it is better not to report memory usage at all compared to reporting a wrong number. In case incoming batches are accumulated in a queue, they should be reported as owned by a receiver. Taking ownership just before passing a batch to the next operator does not sound right to me. I don't think it is necessary to create new fragment child allocator. Receiver allocator should be used instead of fragment allocator when an incoming batch is placed into a queue. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184586222 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { --- End diff -- No, there is no need to handle `IOException` twice. There are no other methods that throw `IOException`. `SchemaChangeException` is actually never thrown. Please see *TODO* comment and DRILL-2933. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184585959 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); + batch = getNextBatch(); + + // skip over empty batches. we do this since these are basically control messages. + while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 --- End diff -- Please explain why it can't be released. Note that the release is done for the batch that will *not* be returned from the method. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184585775 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); --- End diff -- Yes, it does matter. If something goes wrong during `startWait()` there should be no call to `stopWait()`. ---
[GitHub] drill issue #1237: DRILL-6348: Fixed code so that Unordered Receiver reports...
Github user sachouche commented on the issue: https://github.com/apache/drill/pull/1237 @vrozov, your observation is valid, we need more JIRAs to fix the reporting problem **Current Fix** - At this time, the UnorderedReceiver didn't account for any consumed memory - This fix, taxes the operator only when it consumes buffers **Potential Enhancements** Solution I - Create a new fragment child Allocator which will own the received (not yet consumed) batches - Improve the UI to report this allocator size - This solution is simple and complementary to the current work Solution II - Have the receiver operator drain the batch queue - Essentially, the receiver will have a private queue from where to consume batches - I personally don't like this solution as it prevents us from improving the Drill network protocol - Draining the batches for the sake of reporting will make it harder for the network layer to prefetch batches in an optimal manner; the queue size should be an indicator on how many pending batches there are. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184572864 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { --- End diff -- why is that? this method is used once and doing so will make us duplicate exception handling code. I would appreciate if you provide reason(s) when you make suggestions to avoid the back and forth. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184575686 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); + batch = getNextBatch(); + + // skip over empty batches. we do this since these are basically control messages. + while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 + && (!first || batch.getHeader().getDef().getFieldCount() == 0)) { +batch = getNextBatch(); + } +} finally { + stats.stopWait(); +} +return batch; + } + @Override public IterOutcome next() { batchLoader.resetRecordCount(); stats.startProcessing(); -try{ - RawFragmentBatch batch; - try { -stats.startWait(); -batch = getNextBatch(); -// skip over empty batches. we do this since these are basically control messages. -while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 -&& (!first || batch.getHeader().getDef().getFieldCount() == 0)) { - batch = getNextBatch(); -} - } finally { -stats.stopWait(); - } +RawFragmentBatch batch = null; +try { + batch = getNextNotEmptyBatch(); first = false; if (batch == null) { --- End diff -- Sure. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184574622 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -201,6 +208,11 @@ public IterOutcome next() { context.getExecutorState().fail(ex); return IterOutcome.STOP; } finally { + + if (batch != null) { +batch.release(); +batch = null; --- End diff -- This pattern ensures that after release the batch cannot be accessed anymore (e.g., additions to the current code). ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184574405 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); + batch = getNextBatch(); + + // skip over empty batches. we do this since these are basically control messages. + while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 --- End diff -- Sure. FYI - We cannot release the batch within this method. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user sachouche commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184573646 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); --- End diff -- Sure, but does it really matter or is it your personal preference? ---
[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...
Github user jiang-wu commented on the issue: https://github.com/apache/drill/pull/1184 @parthchandra Just to clarify on the JDBC comment. What do you mean by "Json representation"? Do you instead mean the "Local[Date|Time]" class representation? There are no "Json" being returned from the JDBC layer. It uses Java collections Map or List objects. Inside the Map | List, the change in this pull request properly uses objects of different classes: Local [Date|Time|DateTime] to represent the various date/time/timestamp values. So far so good. Now, it is possible in the future, we may want to further translate the Local [Date|Time|DateTime] objects inside the Map|List to java.sql.[Date|Time|Timestamp] upon access. But to do that inside the SqlAccessor, you would need to deep copy the Map|List and build another version with the date|time translated into java.sql.date|time. That would seem like a lot of work for little gain. I would say let's hold off on that for now. A few databases seem to be moving toward using non-timezone based representation in JDBC if the database does not support timezones: https://jdbc.postgresql.org/documentation/head/8-date-time.html It would make sense to consider changing the class used after deciding on what to do with Drill handling of timezones. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184554299 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { --- End diff -- Consider handling `IOException` inside `getNextNotEmptyBatch()`. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184558425 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); + batch = getNextBatch(); + + // skip over empty batches. we do this since these are basically control messages. + while (batch != null && batch.getHeader().getDef().getRecordCount() == 0 --- End diff -- Consider reverting the condition, something like ``` while (true) { RawFragmentBatch batch = getNextBatch(); if (batch == null) { break; } RecordBatchDef recordBatchDef = batch.getHeader().getDef(); if (recordBatchDef.getRecordCount() > 0 || (first && recordBatchDef.getFieldCount() > 0)) { return batch; } batch.release(); } ``` ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184554436 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -149,25 +149,32 @@ private RawFragmentBatch getNextBatch() throws IOException { } } + private RawFragmentBatch getNextNotEmptyBatch() throws IOException { +RawFragmentBatch batch; +try { + stats.startWait(); --- End diff -- Move outside of try/finally. ---
[GitHub] drill pull request #1237: DRILL-6348: Fixed code so that Unordered Receiver ...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1237#discussion_r184559429 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java --- @@ -201,6 +208,11 @@ public IterOutcome next() { context.getExecutorState().fail(ex); return IterOutcome.STOP; } finally { + + if (batch != null) { +batch.release(); +batch = null; --- End diff -- As far as I can see setting `batch` to `null` has no impact here. ---
[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1184 Putting aside the discussion on date/time/timezone for the moment, @jiang-wu let's say getObject returns to you an object that implements java.sql.{Struct|Array}. You now use the Struct|Array apis to get the attribute you are interested in. If the attribute is of type date|time the object returned for that attribute should now correspond to java.sql.{Date|Time} instead of the Json representation. Will that not address your requirement? ---
[GitHub] drill issue #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsException ...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1144 Looks good. +1 ---
[GitHub] drill pull request #1234: DRILL-5927: Fixed memory leak in TestBsonRecordRea...
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/1234#discussion_r184549758 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/store/bson/TestBsonRecordReader.java --- @@ -45,21 +47,24 @@ import org.bson.BsonTimestamp; import org.bson.BsonWriter; import org.bson.types.ObjectId; -import org.junit.AfterClass; -import org.junit.BeforeClass; +import org.junit.After; +import org.junit.Before; import org.junit.Test; -public class TestBsonRecordReader extends BaseTestQuery { - private static VectorContainerWriter writer; - private static TestOutputMutator mutator; - private static BsonRecordReader bsonReader; +public class TestBsonRecordReader { + private BufferAllocator allocator; + private VectorContainerWriter writer; + private TestOutputMutator mutator; + private BufferManager bufferManager; + private BsonRecordReader bsonReader; - @BeforeClass - public static void setUp() { -BufferAllocator bufferAllocator = getDrillbitContext().getAllocator(); -mutator = new TestOutputMutator(bufferAllocator); + @Before + public void setUp() { +allocator = new RootAllocator(Long.MAX_VALUE); --- End diff -- Any reason why you had to change to RootAllocator? Should not really be necessary. ---
[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1184 @parthchandra, the point about the birthday is that is is one of those dates that is implied relative to where you are. You celebrate it the same day regardless of where you are in the world. Same with an order date. So, a key problem is that, for dates, they are not relative to UTC, they are just dates. They become relative to UTC only when a time and timezone is supplied. As @jiang-wu explained, storing in UTC is fine when times are absolute (date + time + tz). The problem is "2018-04-15" or even "2018-04-15 3 PM" is not an absolute: it is local and cannot be stored as UTC unless we know the TZ. Guessing that the TZ is that of the server really does not help, and actually produces wrong results when client and server timezones differ. That's why the data structures need to support the data model @jiang-wu suggested: * Date without TZ * Time without a TZ * Date/time without TZ, and * Timestamp implied in UTC. And, yes, it is because people abused the Java `Date` class that the Joda time classes were invented. We just need to have Drill types that parallel the Joda types. Granted, this is more than this fix can tackle, but the point stands. Agreed that the issue of how to handle JDBC/ODBC needs to be resolved. Can we make up synthetic column names? Implicitly flatten the results so that "context.date" will pick out the "date" element within "context". This will allow JDBC to provide metadata and a reasonable type for each column, at the cost of potentially creating a very wide row (if you have deeply nested maps.) The "auto-flatten" option seems cleaner than any object-based format we make up. A related solution is to fix `sqlline` so that it has a formatter other than `toString()` as @jiang-wu suggested. We register a format class and `sqlline` uses that to format, say, a Drill Map. That way we don't have to use a JSON object so that its `toString()` produces nice output. ---
[GitHub] drill issue #1232: DRILL-6094: Decimal data type enhancements
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1232 +1 for the C++ part. Looks really good. ---
[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...
Github user jiang-wu commented on the issue: https://github.com/apache/drill/pull/1184 Actually, JDBC representation is not he hard problem here. I ran into most of the problems dealing with the timezones surrounding the data|time|timestamp. java.sql.Struct and Array are interfaces and not actual classes. So the Current JDBC returning Map|List object for complex values are fine. You can just declare JsonStringHashMap implements Struct, and JsonStringArryaList implements Array. Now the harder issue. The semantics of "date", "time" are tricker comparing to "timestamp". Timestamp is understood to be an instant in time (java/joda Instant class). Timestamp is a single point in time and not impacted by time zones. date and time can have two uses: 1) logical date and time, which is not fixed point or range in time. e.g. 2018-01-01 is the new year day and this day happens for a different 24 hour window depending on where your time zone is. So this is a logical date, and we don't celebrate the start of the New Year day at the same time. 2) date and time with offset/timezone -- This refers to a specific point or range in time. This type of date/time is absolute. e.g. NYSE opens on "7:30 am EST". Regardless of the timezone you are at, this time is the same for everyone. joda/java8 time package have the proper handling for 1) logical date, time (using local date time classes); 2) absolute date, time (using offset date time classes); and 3) timestamp (instant class) various method exists for you to apply conversions between these based on the heuristics you want to apply. I feel for Drill, we need to first decide what behavior do we want to support first. Then go from there. ---
[GitHub] drill issue #1227: DRILL-6236: batch sizing for hash join
Github user ilooner commented on the issue: https://github.com/apache/drill/pull/1227 @ppadma Please fix travis failure ---
[GitHub] drill issue #1232: DRILL-6094: Decimal data type enhancements
Github user kkhatua commented on the issue: https://github.com/apache/drill/pull/1232 I would recommend we wait till we have all testing complete, since this is a big feature. ---
[GitHub] drill issue #1214: DRILL-6331: Revisit Hive Drill native parquet implementat...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1214 Looks Good. Thanks for making the changes Arina. +1 ---
[GitHub] drill pull request #1225: DRILL-6272: Refactor dynamic UDFs and function ini...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1225#discussion_r184411700 --- Diff: exec/java-exec/pom.xml --- @@ -593,6 +593,48 @@ netty-tcnative ${netty.tcnative.classifier} + + org.apache.maven + maven-embedder + 3.3.9 --- End diff -- Consider using the latest release available. ---
[GitHub] drill pull request #1225: DRILL-6272: Refactor dynamic UDFs and function ini...
Github user vrozov commented on a diff in the pull request: https://github.com/apache/drill/pull/1225#discussion_r184245358 --- Diff: pom.xml --- @@ -798,7 +798,7 @@ com.googlecode.jmockit jmockit - 1.3 + 1.7 --- End diff -- Can it be done as a precursor PR? 1.7 version is quite old too. Can it be upgraded to the latest (org.jmockit:jmockit:1.39)? ---
[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1184 > But, if April 15 is your birthday, it is your birthday in all timezones. We don't say your birthday (or order date, or newspaper issue date or...) is one day in, say London and another day in Los Angeles. If it is your birthday in California, it may already be the day after your birthday in Japan. :) IMO, Representing dates, times, and timestamp's as UTC is not the problem. It is, in fact, perfectly correct (since UTC is the timezone). Converting a date|time|timestamp without a timezone to/from UTC, is the problem. The problem is made worse by java.util and JDBC APIs. java.time gets it right though. However, as Jiang-wu points out, that still does not address the mismatch between Joda/Java8 representation and JDBC. It also does not address his original problem, the issue of how to represent a complex type in JDBC; just return an Object, it says, which is no help at all . It is even worse for ODBC which (last I checked) did not even have an API to return an Object type (which is why in ODBC we return a JSON string representation). For Jiang-wu's use case, since the string representation is not enough, we might look at returning a java.sql.Struct [1] type for Maps and java.sql.Array [2] types. [1] https://docs.oracle.com/javase/7/docs/api/java/sql/Struct.html [2] https://docs.oracle.com/javase/7/docs/api/java/sql/Array.html ---
Re: Display column data type without code
Thanks! A it turns out, typeof() is not documented on the Apache Drill website which is why I was confused. Maybe we should document it. Just tried typeof(). Turns out it only returns the type, not the nullability or repeated mode. Is there a separate function to get the "mode" information? - Paul On Wednesday, April 25, 2018, 10:09:34 PM PDT, Aman Sinhawrote: You can do it through SQL using typeof() function. Since there is no global schema, Drill evaluates this for each row. 0: jdbc:drill:drillbit=10.10.101.41> select n_name, typeof(n_name) as name_type, n_nationkey, typeof(n_nationkey) as nationkey_type from cp.`tpch/nation.parquet` limit 2; *+++--+-+* *| ** n_name ** | **name_type ** | **n_nationkey ** | **nationkey_type ** |* *+++--+-+* *| *ALGERIA * | *VARCHAR * | *0 * | *INT * |* *| *ARGENTINA * | *VARCHAR * | *1 * | *INT * |* *+++--+-+* On Wed, Apr 25, 2018 at 9:22 PM, Paul Rogers wrote: > Hi All, > Anyone know if there is a non-code way to display the data types of > columns returned from a Drill query? Sqlline appears to only show the > column names and values. The same is true of the Drill web console. > The EXPLAIN PLAN FOR ... command shows the query plan, but not type (which > are only known at run time.) Is there a statement, system table or some > other trick to display column types in, say, Sqlline? > In the past, I've gotten the types by using unit test style code. But, > that is not to handy for use as an example for non-developers... > Thanks, > - Paul > >
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184483395 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) { */ private int netRowWidth; private int netRowWidthCap50; + + /** + * actual row size if input is not empty. Otherwise, standard size. + */ + private int rowAllocSize; --- End diff -- @paul-rogers yes, this PR is trying to do the second thing i.e. make best guess with no knowledge at all about lengths. Also, right side can produce no rows for reasons other than empty array i.e. for ex. we might be filtering everything out after unnest. Please let me know what you think about the approach. ---
Re: Display column data type without code
Hi Paul, You could also use DESCRIBE (https://drill.apache.org/docs/describe/). 0: jdbc:drill:drillbit=localhost:31010> describe `hive.default`.`integer_table` . . . . . . . . . . . . . . . . . . . > ; +--++--+ | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | +--++--+ | keycolumn| CHARACTER VARYING | YES | | column1 | INTEGER| YES | +--++ Best regards, Rob On Wed, Apr 25, 2018 at 10:12 PM, Abhishek Girishwrote: > Hey Paul, > > You could use the typeof() function for this purpose. It takes a single > parameter - the column name. > > For example: > > select typeof(c_current_cdemo_sk) from customer limit 1; > +-+ > | EXPR$0 | > +-+ > | BIGINT | > +-+ > 1 row selected (0.472 seconds) > > > On Wed, Apr 25, 2018 at 9:23 PM Paul Rogers > wrote: > > > Hi All, > > Anyone know if there is a non-code way to display the data types of > > columns returned from a Drill query? Sqlline appears to only show the > > column names and values. The same is true of the Drill web console. > > The EXPLAIN PLAN FOR ... command shows the query plan, but not type > (which > > are only known at run time.) Is there a statement, system table or some > > other trick to display column types in, say, Sqlline? > > In the past, I've gotten the types by using unit test style code. But, > > that is not to handy for use as an example for non-developers... > > Thanks, > > - Paul > > > > >
[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1238 There is not enough info available to debug and/or troubleshoot DRILL-5908 and I prefer instead of trying to find bugs in homegrown solution replace it with Java out of the box functionality and at the same time provide an ability to log enough information to do RCA for DRILL-5908. IMO, there are no unreasonable requests on PR/JIRA ð. ---
[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable
Github user parthchandra commented on the issue: https://github.com/apache/drill/pull/1238 Fair enough. But that *still* does not give me a clue about the problem(s) you were trying to fix, or how the refactoring helps. Is the cause of the problem in TimedRunnable? Not an unreasonable request is it? ---
[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1238 I did not change how tasks (`Runnable` or `Callable`) behave and did not look into converting `Callable/Runnable` to a `ForkJoinTask`. Whether existing tasks can be scheduled recursively or not depends on the nature of those tasks and is not the scope of this PR. I'd suggest filing a JIRA if one does not exist already to look into `Fork/Join` (this is what I would expect from the developer who put "TODO"). ---
[GitHub] drill issue #1238: DRILL-6281: Refactor TimedRunnable
Github user vrozov commented on the issue: https://github.com/apache/drill/pull/1238 The step is necessary to do RCA for DRILL-5908. There are way too many issues with the current implementation to list them in JIRA or PR and the major issue is the usage of homegrown solutions where Java (or other 3rd party libraries) already provides a required functionality out of the box. There is no need to use `Runnable` instead of `Callable` and provide custom `Callable` functionality. It is not necessary to wait on a `CountDownLatch` when `ExecutionService` provides the ability to invoke all tasks and return results when all tasks complete or a timeout expires. ---
[jira] [Resolved] (DRILL-4156) Parquet group converter cannot cope with ENUM original type values
[ https://issues.apache.org/jira/browse/DRILL-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva resolved DRILL-4156. - Resolution: Fixed Fixed in DRILL-5971. > Parquet group converter cannot cope with ENUM original type values > --- > > Key: DRILL-4156 > URL: https://issues.apache.org/jira/browse/DRILL-4156 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.3.0, 1.4.0 >Reporter: Chris Jansen >Priority: Major > Labels: easyfix, newbie, patch > Fix For: 1.13.0 > > Attachments: parquet_enum.patch, parquet_enum_v2.patch > > > When importing parquet data written using a Thrift schema, Drill cannot cope > with ENUMs. This was fixed elsewhere in DRILL-1775 but not in the > DrillParquetGroupConverter class ([see mailing list > thread|https://mail-archives.apache.org/mod_mbox/drill-user/201508.mbox/%3CCAK55V=+VjPAehMiJ8U=crh182k4zvhgfxzfwznaa9edwzwg...@mail.gmail.com%3E]). > It appears that the ParquetToDrillTypeConverter class was updated to use a > varbinary as it's default behaviour, but the DrillParquetGroupConverter class > was not updated to do the same. > {code} > Caused by: java.lang.UnsupportedOperationException: Unsupported type ENUM > at > org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.getConverterForType(DrillParquetGroupConverter.java:249) > at > org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:154) > at > org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:147) > at > org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:147) > at > org.apache.drill.exec.store.parquet2.DrillParquetGroupConverter.(DrillParquetGroupConverter.java:147) > at > org.apache.drill.exec.store.parquet2.DrillParquetRecordMaterializer.(DrillParquetRecordMaterializer.java:40) > at > org.apache.drill.exec.store.parquet2.DrillParquetReader.setup(DrillParquetReader.java:267) > ... 14 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...
Github user jiang-wu commented on the issue: https://github.com/apache/drill/pull/1184 Yes. There are at least two issues. One is about how Drill represent Date, Time, Timestamp internally using a UTC based instant representation and fudges the timezone in order to make the none time zone fields looking right, that Paul outlined nicely above. To really fix this, one would need to first define how Drill wishes to handle Date, Time, Timestamp: e.g. no time zone at all, time zone aware but not preserving, time zone aware and preserving, etc. One can look at how databases handle time zones to get some inspirations. This part is too ambitious for me to fix here. The second is to obtain a complex object value from JDBC interface. Drill doesn't make a JSON object in order to send the data to a JDBC interface. It looks like the JDBC interface is simple an alternative accessor to vector data being transmitted from the server side to the client side. Once the vector data arrives on the client side, the JDBC layer builds a Map (or List) object by reading the values from the vectors. The issue I found was that inside this process, date|time|timestamp values from their respective vector classes were all represented using the same Java class, hence losing its type information all together when the Java object is placed inside the Map|List. So a fix for this part is simple (in concept), just use different Java types. Except, when making this change, the first issue popped up as one has to figure out how to make Date and Time out of UTC instant values, which are these fudged values from the existing logic ---
[GitHub] drill issue #1214: DRILL-6331: Revisit Hive Drill native parquet implementat...
Github user arina-ielchiieva commented on the issue: https://github.com/apache/drill/pull/1214 @parthchandra added two new commits: 1. reverted custom Stopwatch implementation and used logger checks instead. 2. allowed to create several non-tracking fs but only one tracking per operator context (details in prior discussion). ---
[GitHub] drill pull request #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] cl...
Github user jiang-wu commented on a diff in the pull request: https://github.com/apache/drill/pull/1184#discussion_r184424638 --- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java --- @@ -509,15 +509,15 @@ public long getTwoAsLong(int index) { public ${friendlyType} getObject(int index) { org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); - return date; + return new java.sql.Date(date.getMillis()); --- End diff -- The string representation between LocalDateTime and Timestamp are not exactly the same, but that is potentially fixable since we can alter the way the values are displayed via formatters. Though JDBC is not just for getting string representations. It is on the programmatic use cases where we are getting the value objects where one would see the disconnect on the data types. Will try this out with a use case I have with programmatic JDBC access and see what are the impacts on different types for the same expected value. ---
[GitHub] drill pull request #1214: DRILL-6331: Revisit Hive Drill native parquet impl...
Github user arina-ielchiieva commented on a diff in the pull request: https://github.com/apache/drill/pull/1214#discussion_r184401600 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/BaseOperatorContext.java --- @@ -158,25 +159,26 @@ public void close() { } catch (RuntimeException e) { ex = ex == null ? e : ex; } -try { - if (fs != null) { + +for (DrillFileSystem fs : fileSystems) { + try { fs.close(); -fs = null; - } -} catch (IOException e) { + } catch (IOException e) { throw UserException.resourceError(e) -.addContext("Failed to close the Drill file system for " + getName()) -.build(logger); + .addContext("Failed to close the Drill file system for " + getName()) + .build(logger); + } } + if (ex != null) { throw ex; } } @Override public DrillFileSystem newFileSystem(Configuration conf) throws IOException { -Preconditions.checkState(fs == null, "Tried to create a second FileSystem. Can only be called once per OperatorContext"); -fs = new DrillFileSystem(conf, getStats()); +DrillFileSystem fs = new DrillFileSystem(conf, getStats()); --- End diff -- When `AbstractParquetScanBatchCreator.getBatch` method is called, it receives one operator context which is used to allow to create only one file system. It also receives `AbstractParquetRowGroupScan` which contains several row groups. Row groups may belong to different files. For Drill parquet files, we create only one fs and use it for to create readers for each row group. That's why it was fine when operator context allowed to create only one fs. But we needed to adjust it for Hive files. For Hive we need to create fs for each file (since config to each file system is different and created using projection pusher), that's why I had to change operator context to allow more then one file system. I have also introduced `AbstractDrillFileSystemManager` which controls number of file systems created. `ParquetDrillFileSystemManager` creates only one (as was done before). `HiveDrillNativeParquetDrillFileSystemManager` creates fs for each file, so when two row groups belong to the same file, they will get the same fs. But I agree that for tracking fs (i.e. store.parquet.reader.pagereader.async is set to false) this will create mess in calculations. So I suggest the following fix, for Hive we'll always create non tracking fs, for Drill depending on store.parquet.reader.pagereader.async option. Also I'll add checks in operator context to disallow to create more then one tracking fs and to create tracking fs at all when non-tracking is / are already created. ---
[GitHub] drill issue #1184: DRILL-6242 - Use java.sql.[Date|Time|Timestamp] classes t...
Github user paul-rogers commented on the issue: https://github.com/apache/drill/pull/1184 One additional note. We noted that JDBC does not support the idea of a nested tuple (a Drill "map".) JDBC does support columns that return a Java object. To bridge the gap, Drill returns a Map column as a Java object. But, why a JSON object? The answer seems to lie with the `sqline` program. If we query a one-line JSON file with a nested object in `sqlline`, we get the following display: ``` SELECT * FROM `json/nested.json`; +-+--+ | custId | name | +-+--+ | 101 | {"first":"John","last":"Smith"} | +-+--+ ``` So, it seems likely that the the value of a Map object was translated to a JSON object so that when `sqlline` calls `toString()` on it, it ends up formatted nicely as above. Because of this, it may be hard to change the kind of objects returned from JDBC for a Map column. ---