Re: [DISCUSS] batch ownership

2018-04-29 Thread Paul Rogers
Hi Vlad,
More responses.
> The same approach [as for internal operators] applies to senders and 
> receivers. Senders gets batches 
from the upstream operators taking ownership of those batches and send 
data to receivers.

Senders receive data from an "upstream" operator, then serialize over the wire. 
As a result, Senders take ownership from the upstream operator, but then must 
transfer ownership to Netty. Here I'll speculate. I believe that we create a 
Netty composite buffer that strings together the buffers that underlie the 
value vectors in the outgoing record batch. (Yes, there are many layers in 
play.)

Netty does not know about our allocator model. It does, however, have a 
reference count. So, my guess is that the Sender somehow gives up ownership of 
the outgoing buffer in the sense of the Drill allocator, but lets Netty drop 
the reference count once Netty has sent the buffer.

I believe you are quite familiar with Netty, so perhaps you can dig around here 
and explain how this actually works.

> Receivers get data from senders and reconstruct 
record batches.

You are right logically. But, physically there is a difference. Data arrives 
via Netty which allocates buffers for the data. Receivers take these raw 
buffers and turn them into batches. Here things get even more complex (if that 
is possible.) The Receiver creates multiple vectors on top of a single Netty 
buffer. That is, multiple vectors were serialized together and were read 
together. Much of the complexity of Drill's memory model comes from the ability 
to create multiple (logical) DrillBufs on top of a single (physical) Netty 
buffer. This is where we need reference counts (so we know when the last shared 
use goes away), and where we need the UDLE/DrillBuf separation.

So, again, Netty does not play the Drill "ownership" game, it only does 
reference counts. So the Receiver must convert from the Netty reference count 
of the big incoming buffer, to reference counts for each materialized vector, 
and create some kind of entry in Drill's allocator. I'm not sure how this is 
done; it would be great if you could figure this out.

Could this be done differently? Probably. Maybe serialize each buffer by itself 
so that Netty creates separate buffers for each. I'd guess the original authors 
started with this design and moved to the present one, perhaps for performance 
reasons. (Anyone know of the history here?)

> It is the business logic of senders and receivers and 
they may rely on other libraries (rpc and netty) or classes to handle 
serialization/de-serialization, buffering, acknowledgment, back-pressure 
or dealing with network. From other Drill operators point of view, 
senders and receivers are operators responsible for passing record 
batches from one drillbit to another.

True. Senders/Receivers should speak Drill operator protocol on one side, Netty 
protocol on the other. They are adapters. Is this not what you see?

> Following your approach it is necessary to modify MergingReceiver as 
well. It also pulls batches from a queue (see 
MergingRecordBatch.getNext()), but instead of almost immediately passing 
it to a next operator as UnorderReceiver does, MergingReceiver creates a 
new record batch from those batches that it pulls from the queue. To be 
consistent with proposed changes to UnorderReceiver, it is necessary to 
change the ownership of batches that MergingReceiver pulls as well 
especially that MergingReciver may keep reference to the original batch 
much longer compared to UnorderedReceiver (while it waits for batches 
from other drillbits).

I personally don't know the details. But, in general, if one operator passes 
data to another, it should play by the Drill ownership rules if it works with 
vectors. If, instead, it works with buffers, then it should probably play by 
the Netty rules.

> I don't see a reason to modify both UnorderedReceiver and 
MergingReceiver, instead, I think, we should modify allocator used when 
batches are created in the first place before they are added to a queue.

My own suggestion here is that we may want to make use of an old-school 
technique that is still often handy: write up the design. Document the rules 
I've been doing my best to explain above. Add a detailed explanation of how 
Drill interfaces with Netty. Then, think through how we wan to handle the 
Drill-opererator-to-Netty interface.

Another particularly nasty area is the "Mux" operators. Several folks struggled 
to understand them and didn't get very far. This is not a good state to be in. 
We should really understand how they work. Perhaps understanding the most 
complex case will help shed light on the case under discussion.
Thanks,

- Paul


  

[GitHub] drill pull request #1243: Solve unable to get jquery in the intranet

2018-04-29 Thread mayyamus
Github user mayyamus closed the pull request at:

https://github.com/apache/drill/pull/1243


---


Re: [DISCUSS] batch ownership

2018-04-29 Thread Paul Rogers
Specific answers based on my understanding.

 > I did not mean that a pass-through operator should not take the 
ownership of a batch it processes. My question was whether they do so 
and if they do, when and how.

Yes, operators do take ownership, somewhere in the process of calling next() on 
their inputs. The exact place may vary between operators. In the Sort, for 
example, the code first checks the incoming batch size, spills sorted batches 
if needed to make space, then takes ownership. I'd go so far as to say that, if 
an operator does not take ownership, then it is a bug.

> As far as I can see in the 
ProjectorTemplate code, the transfer is not done in all cases and when 
Projector operates in sv2 mode, there is no transfer of the ownership. 

Template code is code that is copied for each generated operator. In general, 
this code should be minimal. Code that is common to all operator instances 
should not reside in the template. Instead, it should reside in the operator 
(the so-called RecordBatch). There is really no reason to copy the same byte 
codes over and over, taking up space in the code cache.

That said, the code to take ownership is likely to be in the Project operator 
implementation. Look for a place that works with "transfer pairs", they are the 
actual transfer mechanism. A quick glance at the code suggests this is done in 
ProjectRecordBatch.setupNewSchemaFromInput(). (An unfortunate name if we also 
do transfers.)

> Additionally, when there is a transfer, it is done when the processing 
of the batch is almost complete. 

Depends on what you mean by "almost complete." Since Project is 
single-threaded, there is no harm in doing the transfer later rather than 
sooner; the upstream operator won't be called until Project again calls next(). 
Makes sense to do it earlier, but not necessary.

> IMO, such behavior is counter intuitive 
and I would expect that if there is a transfer of the ownership, it is 
part of  RecordBatch.next(), meaning that once an operator gets a 
reference to a record batch, it owns it. 

Perhaps. But, the Operator (that is, RecordBatch) protocol is a bit fussy. The 
next() call to RecordBatch tells that RecordBatch to build a batch of data and 
make it available. An operator has no visibility to its parent (its downstream 
operator). The caller must do the transfer as only the caller has visibility to 
its own vector container and that of the upstream (incoming) record batch. Yes, 
this is quite confusing. Nothing beats stepping though several operators to see 
how this works in practice.

Here, I will put in a plug for the revised Operator classes in the "batch 
handling" code. The new classes try to disentangle the many bits of 
functionality combined in Record Batch. Those three are: 1) iterator protocol, 
2) batch management, and 3) operator implementation. I believe we'll all 
understand this code better if we can separate these three concerns.

> At this point, an operator may 
consume content of the record batch and create a completely new record 
batch or it can modify the record batch and pass it to the next 
downstream operator.

Just to be clear, record batches (specifically vectors) are immutable. It is 
not possible to modify a record batch. One can, however reuse parts of it. A 
Filter can slap on an SV2. A Project can discard some vectors, add others, and 
retain still others. But, in both cases, the operator must produce a new batch 
based on those vectors. Specifically, each operator has its own VectorContainer 
that contain its own vectors. Sharing occurs at the level of DrillBufs that 
underlie the vectors. (Again, quite confusing, but it makes sense once you 
understand the operator allocators we discussed previously.)

Part of the complexity comes from proper memory management. New vectors are 
allocated in the Project operator's allocator. Retained vectors are transferred 
from the upstream operator's allocator (ledger) to the that of the Project 
operator. Discarded vectors are released (perhaps after being shifted into the 
Project operator's allocator.)

OK, again enough for one note. More to come.

Thanks,

- Paul
  

Re: [DISCUSS] batch ownership

2018-04-29 Thread Paul Rogers
Hi Vlad,

Glad to see you are becoming an expert in the mechanics of data batch handling. 
This is a complex area that deserves the care and attention your are investing.

Drill's current behavior reflects the design decisions of Drill's original 
authors. Unfortunately, those authors are no longer available. (If you are out 
there, lurking, now would be a great time to help out Vlad by explaining the 
original design.) Failing that, we have to use our collective knowledge of the 
intended design. Plus, we should explore ways to improve the design, as you 
seem to be doing.

Drill has a complex memory model that works only if each operator ("record 
batch" in Drill's unfortunate terminology) takes ownership of each incoming 
record batch ("vector container" in Drill's terminology.) Recall that each 
operator has an operator-specific memory allocator with its own budget (though, 
at present, but budget numbers are completely artificial and nonsensical.) In 
addition, the minor fragment as a whole has a budget.

For the operator budget to work, the operator must take ownership of incoming 
batches, and give up ownership of outgoing batches. Why? Because doing so is 
the only way to track the memory that each operator uses in its 
operator-specific allocator. While this may not be the ideal design, it is how 
Drill works today.

If we move fully to the budget-based design, than this level of operator 
control will no longer be necessary, and will be an unnecessary complication. 
Under the budget model, only the minor fragment as a whole needs an allocator; 
each operator plays its part within the overall fragment budget. A planning 
step works out the memory budget for the query, the minor fragments and each 
operator. This is all explained in [1].

Under the budget model, each operator attempts to stay within its budget, 
spilling to disk as needed. The budget model works only if "single batch" 
operators (such as Project, Filter, etc.) are given sufficient memory to hold 
two batches. This, in turn, requires that we control the size of each batch as 
Padma and others are doing.

That said, today exchanges *might* be special. My understanding is that some 
can receive a single batch from the network and feed that single batch to 
multiple slices ("minor fragments") of the same operator. This happens in, say, 
a broadcast exchange.

You mention SV2 mode. In fact, SV2 mode should operate the same as "plain" 
batches: an SV2 is a single indirection vector on a single batch of data. 
Perhaps you meant "SV4 mode." Indeed, SV4 is special since an SV4 sits atop a 
large collection of batches and simulates a batch by picking out a collection 
of rows across the many batches. SV4 is used in the output of an in-memory sort 
(and perhaps other places.) There is no transfer of ownership in SV4 mode 
because the same batches will be used over and over until all data is 
delivered. It is the responsibility of the Sort operator to release the 
collection of batches once it has delivered all results (or the query fails.)


Enough for this response. I'll send additional responses for your other points.

The key concept to keep in mind is that the Drill memory system, as a whole, is 
quite complex. It can certainly be improved (as we are doing with the batch 
handling revisions.) But, we must consider the entire system when considering 
changes to any one part of the system. It is a complex topic; it is great that 
we have someone with your experience exploring our options.

Thanks,
- Paul

[1]  https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades


 

On Sunday, April 29, 2018, 9:26:24 PM PDT, Vlad Rozov  
wrote:  
 
 I did not mean that a pass-through operator should not take the 
ownership of a batch it processes. My question was whether they do so 
and if they do, when and how. As far as I can see in the 
ProjectorTemplate code, the transfer is not done in all cases and when 
Projector operates in sv2 mode, there is no transfer of the ownership. 
Additionally, when there is a transfer, it is done when the processing 
of the batch is almost complete. IMO, such behavior is counter intuitive 
and I would expect that if there is a transfer of the ownership, it is 
part of  RecordBatch.next(), meaning that once an operator gets a 
reference to a record batch, it owns it. At this point, an operator may 
consume content of the record batch and create a completely new record 
batch or it can modify the record batch and pass it to the next 
downstream operator.

The behavior above applies to an operator that consumes record batches 
from another operator. An input operator (scan or edge operator) is an 
operator that produces record batches from an external source (parquet 
file, hbase, kafka, etc). IMO, when such operators create record batches 
they should allocate memory using operator allocator compared to 
fragment allocator. If the memory is allocated using fragment allocator, 
there is 

[DISCUSS] batch ownership

2018-04-29 Thread Vlad Rozov
I did not mean that a pass-through operator should not take the 
ownership of a batch it processes. My question was whether they do so 
and if they do, when and how. As far as I can see in the 
ProjectorTemplate code, the transfer is not done in all cases and when 
Projector operates in sv2 mode, there is no transfer of the ownership. 
Additionally, when there is a transfer, it is done when the processing 
of the batch is almost complete. IMO, such behavior is counter intuitive 
and I would expect that if there is a transfer of the ownership, it is 
part of  RecordBatch.next(), meaning that once an operator gets a 
reference to a record batch, it owns it. At this point, an operator may 
consume content of the record batch and create a completely new record 
batch or it can modify the record batch and pass it to the next 
downstream operator.


The behavior above applies to an operator that consumes record batches 
from another operator. An input operator (scan or edge operator) is an 
operator that produces record batches from an external source (parquet 
file, hbase, kafka, etc). IMO, when such operators create record batches 
they should allocate memory using operator allocator compared to 
fragment allocator. If the memory is allocated using fragment allocator, 
there is no point changing ownership when batch construction is complete 
and the batch is passed to the next operator.


The same approach applies to senders and receivers. Senders gets batches 
from the upstream operators taking ownership of those batches and send 
data to receivers. Receivers get data from senders and reconstruct 
record batches. It is the business logic of senders and receivers and 
they may rely on other libraries (rpc and netty) or classes to handle 
serialization/de-serialization, buffering, acknowledgment, back-pressure 
or dealing with network. From other Drill operators point of view, 
senders and receivers are operators responsible for passing record 
batches from one drillbit to another.


Following your approach it is necessary to modify MergingReceiver as 
well. It also pulls batches from a queue (see 
MergingRecordBatch.getNext()), but instead of almost immediately passing 
it to a next operator as UnorderReceiver does, MergingReceiver creates a 
new record batch from those batches that it pulls from the queue. To be 
consistent with proposed changes to UnorderReceiver, it is necessary to 
change the ownership of batches that MergingReceiver pulls as well 
especially that MergingReciver may keep reference to the original batch 
much longer compared to UnorderedReceiver (while it waits for batches 
from other drillbits).


I don't see a reason to modify both UnorderedReceiver and 
MergingReceiver, instead, I think, we should modify allocator used when 
batches are created in the first place before they are added to a queue.


Thank you,

Vlad

On 4/27/18 18:10, salim achouche wrote:

Correction for example II as Drill uses a single thread per pipeline (a
batch is fully processed before the next one is; only receive of batches
can happen concurrently):
- Using batch identifiers for more clarity
- t0: (fragment, opr-1, opr-2) = ([b1], [], [])
- t1: (fragment, opr-1, opr-2) = ([b2], [b1], [])
- t2: (fragment, opr-1, opr-2) = ([b3,b2], [], [b1])
(fragment, opr-1, opr-2) = ([b3], [b2], [])
(fragment, opr-1, opr-2) = ([b3], [], [b2])
(fragment, opr-1, opr-2) = ([], [b3], [])
(fragment, opr-1, opr-2) = ([], [], [b3])

The point remains the same that change of ownership for pass-through
remains valid as it doesn't inflate resource allocation for a given time
snapshot.


On Sat, Apr 28, 2018 at 12:42 AM, salim achouche 
wrote:


Another point, I don't see a functional benefit from avoiding a change of
ownership for pass-through operators. Consider the following use-cases:

Example I -
- Single batch of size 8MB is received at time t0 and then is passed
through a set of pass-through operators
- At time t1 owned by operator Opr1, time t2 owned by operator t2, and so
forth
- Assume we report memory usage at time t0 - t2; this is what will be seen
- t0: (fragment, opr-1, opr-2) = (8Mb, 0, 0)
- t1: (fragment, opr-1, opr-2) = (0, 8MB, 0)
- t2: (fragment, opr-1, opr-2) = (0, 0, 8MB)

Example II -
- Multiple batches of size 8MB are received at time t0 - t2 and then is
passed through a set of pass-through operators
- At time t1 owned by operator Opr1, time t2 owned by operator t2, and so
forth
- Assume we report memory usage at time t0 - t2; this is what will be seen
- t0: (fragment, opr-1, opr-2) = (8Mb, 0, 0)
- t1: (fragment, opr-1, opr-2) = (8Mb, 8MB, 0)
- t2: (fragment, opr-1, opr-2) = (8Mb, 8Mb, 8MB)


The key thing is that we clarify our reporting metrics so that users do
not make the wrong conclusions.

Regards,
Salim

On Fri, Apr 27, 2018 at 11:47 PM, salim achouche 
wrote:


Vlad,

- My understanding is that operators need to take ownership of 

[GitHub] drill issue #1236: DRILL-6347: Inconsistent method name "field".

2018-04-29 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1236
  
LGTM. Please squash commits.


---


[GitHub] drill issue #1235: DRILL-6336: Inconsistent method name.

2018-04-29 Thread vrozov
Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1235
  
My take is that "append" is more common for classes with the similar 
functionality, see for example `ToStringBuilder`. As there is no added benefit 
of using "print" vs "append", my recommendation is to keep "append" as is and 
see if `DebugStringBuilder` can be replaced with the `ToStringBuilder`.


---


[jira] [Created] (DRILL-6371) Use FilterSetOpTransposeRule, DrillProjectSetOpTransposeRule in main logical stage

2018-04-29 Thread Vitalii Diravka (JIRA)
Vitalii Diravka created DRILL-6371:
--

 Summary: Use FilterSetOpTransposeRule, 
DrillProjectSetOpTransposeRule in main logical stage
 Key: DRILL-6371
 URL: https://issues.apache.org/jira/browse/DRILL-6371
 Project: Apache Drill
  Issue Type: Improvement
  Components: Query Planning  Optimization
Affects Versions: 1.13.0
Reporter: Vitalii Diravka
 Fix For: Future


FilterSetOpTransposeRule, DrillProjectSetOpTransposeRule are leveraged in 
DRILL-3855.
They are used in HepPlanner, but if they additionally will be enabled in main 
logical planning stage for Volcano planner, more cases will be covered with 
these rules.
For example: 
{code}
WITH year_total_1
 AS (SELECT c.r_regionkeycustomer_id,
1 year_total
 FROM   cp.`tpch/region.parquet` c
 UNION ALL
 SELECT c.n_nationkeycustomer_id,
1 year_total
 FROM   cp.`tpch/nation.parquet` c),
 year_total_2
 AS (SELECT c.r_regionkeycustomer_id,
1 year_total
 FROM   cp.`tpch/region.parquet` c
 UNION ALL
 SELECT c.n_nationkeycustomer_id,
1 year_total
 FROM   cp.`tpch/nation.parquet` c)
SELECT count(t_w_firstyear.customer_id) as ct
FROM   year_total_1 t_w_firstyear,
   year_total_2 t_w_secyear
WHERE  t_w_firstyear.year_total = t_w_secyear.year_total
 AND t_w_firstyear.year_total > 0 and t_w_secyear.year_total > 0
{code}

Currently using them in Volcano Planner can cause infinite loops - CALCITE-1271



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-3130) Project can be pushed below union all / union to improve performance

2018-04-29 Thread Vitalii Diravka (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalii Diravka resolved DRILL-3130.

   Resolution: Done
Fix Version/s: (was: 1.1.0)
   1.14.0

Resolved in DRILL-3855

> Project can be pushed below union all / union to improve performance
> 
>
> Key: DRILL-3130
> URL: https://issues.apache.org/jira/browse/DRILL-3130
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning  Optimization
>Reporter: Sean Hsuan-Yi Chu
>Assignee: Vitalii Diravka
>Priority: Major
> Fix For: 1.14.0
>
>
> A query such as 
> {code}
> Select a from 
> (select a, b, c, ..., union all select a, b, c, ...)
> {code}
> will perform Union-All over all the specified columns on the two sides, 
> despite the fact that only one column is asked for at the end. Ideally, we 
> should perform ProjectPushDown rule for Union & Union-All to avoid them to 
> generate results which will be discarded at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (DRILL-2746) Filter is not pushed into subquery past UNION ALL

2018-04-29 Thread Vitalii Diravka (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitalii Diravka resolved DRILL-2746.

   Resolution: Done
Fix Version/s: (was: 1.1.0)
   1.14.0

Resolved in DRILL-3855

> Filter is not pushed into subquery past UNION ALL
> -
>
> Key: DRILL-2746
> URL: https://issues.apache.org/jira/browse/DRILL-2746
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Query Planning  Optimization
>Affects Versions: 0.9.0
>Reporter: Victoria Markman
>Assignee: Vitalii Diravka
>Priority: Major
> Fix For: 1.14.0
>
>
> I expected to see filter pushed to at least left side of UNION ALL, instead 
> it is applied after UNION ALL
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select * from (select a1, b1, c1 
> from t1 union all select a2, b2, c2 from t2 )  where a1 = 10;
> +++
> |text|json|
> +++
> | 00-00Screen
> 00-01  Project(a1=[$0], b1=[$1], c1=[$2])
> 00-02SelectionVectorRemover
> 00-03  Filter(condition=[=($0, 10)])
> 00-04UnionAll(all=[true])
> 00-06  Project(a1=[$2], b1=[$1], c1=[$0])
> 00-08Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/predicates/t1]], 
> selectionRoot=/drill/testdata/predicates/t1, numFiles=1, columns=[`a1`, `b1`, 
> `c1`]]])
> 00-05  Project(a2=[$1], b2=[$0], c2=[$2])
> 00-07Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=maprfs:/drill/testdata/predicates/t2]], 
> selectionRoot=/drill/testdata/predicates/t2, numFiles=1, columns=[`a2`, `b2`, 
> `c2`]]])
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill pull request #1210: DRILL-6270: Add debug startup option flag for dril...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1210


---


[GitHub] drill pull request #1216: DRILL-6173: Support transitive closure during filt...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1216


---


[GitHub] drill pull request #1230: DRILL-6345: DRILL Query fails on Function LOG10

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1230


---


[GitHub] drill pull request #1196: DRILL-6286: Fixed incorrect reference to shutdown ...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1196


---


[GitHub] drill pull request #1226: DRILL-3855: Enable FilterSetOpTransposeRule, Drill...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1226


---


[GitHub] drill pull request #1222: DRILL-6341: Fixed failing tests for mongodb storag...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1222


---


[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1218


---


[GitHub] drill pull request #1144: DRILL-6202: Deprecate usage of IndexOutOfBoundsExc...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1144


---


[GitHub] drill pull request #1234: DRILL-5927: Fixed memory leak in TestBsonRecordRea...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1234


---


[GitHub] drill pull request #1240: DRILL-6327: Update unary operators to handle IterO...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1240


---


[GitHub] drill pull request #1217: DRILL-6302: Fixed NPE in Drillbit close method

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1217


---


[GitHub] drill pull request #1220: DRILL-6328: Consolidate developer docs in docs fol...

2018-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1220


---


[GitHub] drill issue #1236: DRILL-6347: Inconsistent method name "field".

2018-04-29 Thread BruceKuiLiu
Github user BruceKuiLiu commented on the issue:

https://github.com/apache/drill/pull/1236
  
@vrozov Thanks.



---


[jira] [Created] (DRILL-6370) Mod operator % is documented, but not available

2018-04-29 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6370:
--

 Summary: Mod operator % is documented, but not available
 Key: DRILL-6370
 URL: https://issues.apache.org/jira/browse/DRILL-6370
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.13.0
Reporter: Paul Rogers


The [Operators|http://drill.apache.org/docs/operators/] page in the 
documentation states that the {{%}} operator does modulo division. The first 
issue is that {{%}} is listed in the precedence table, but not the math 
operator table.

Suppose we try to use the operator:

{noformat}
SELECT 10 % 3 FROM (VALUES(1));

Error: PARSE ERROR: Percent remainder '%' is not allowed under the
  current SQL conformance level
{noformat}

It seems that if we list the operator, we should support it. Or, failing that, 
add a note to say that the {{%}} operator is not currently supported.

The workaround is to use the {{mod()}} function:

{noformat}
SELECT mod(10, 3) FROM (VALUES(1));
+-+
| EXPR$0  |
+-+
| 1   |
+-+
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] drill issue #1224: DRILL-6321: Customize Drill's conformance. Allow support ...

2018-04-29 Thread vvysotskyi
Github user vvysotskyi commented on the issue:

https://github.com/apache/drill/pull/1224
  
As I understand from DRILL-1921, cross join was prevented due to the 
`CannotPlanException` exception at the planning stage. 
Can we get the same problem using `APPLY`? If yes, should be discussed the 
possibility of adding some limitations for `APPLY`, for example, deny usage for 
the case when a filter is absent in the query etc.


---


[GitHub] drill issue #1243: Solve unable to get jquery in the intranet

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1243
  
@mayyamus please create Apache Jira for the fix first 
(https://drill.apache.org/docs/apache-drill-contribution-guidelines/). Also 
please note that the part of code you are changing was done intentionally in 
https://issues.apache.org/jira/browse/DRILL-5699. Is there a way to preserve 
original intention and fix your issue?


---


[GitHub] drill issue #1233: Updated with links to previous releases

2018-04-29 Thread kkhatua
Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/1233
  
@arina-ielchiieva  I'll change the PR as suggested by Parth. Since Bridget 
does the merges for _gh-pages_ repo, I'll ask her to close the PR. 


---


[GitHub] drill pull request #1243: Solved unable to get jquery on the intranet

2018-04-29 Thread mayyamus
GitHub user mayyamus opened a pull request:

https://github.com/apache/drill/pull/1243

Solved unable to get jquery on the intranet

Running on the intranet, access is slow due to the inability to get jquery. 
Modified to directly access local jQuery resources.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mayyamus/drill minor_fix_js_timeout

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1243






---


[GitHub] drill pull request #1222: DRILL-6341: Fixed failing tests for mongodb storag...

2018-04-29 Thread cgivre
Github user cgivre commented on a diff in the pull request:

https://github.com/apache/drill/pull/1222#discussion_r184882305
  
--- Diff: 
contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/MongoTestSuit.java
 ---
@@ -128,42 +130,63 @@ private static void setup() throws Exception {
   createDbAndCollections(DATATYPE_DB, DATATYPE_COLLECTION, "_id");
 }
 
-private static IMongodConfig crateConfigServerConfig(int 
configServerPort,
-boolean flag) throws UnknownHostException, IOException {
-  IMongoCmdOptions cmdOptions = new 
MongoCmdOptionsBuilder().useNoJournal(false).verbose(false)
-  .build();
+private static IMongodConfig crateConfigServerConfig(int 
configServerPort) throws UnknownHostException, IOException {
+  IMongoCmdOptions cmdOptions = new MongoCmdOptionsBuilder()
+.useNoPrealloc(false)
+.useSmallFiles(false)
+.useNoJournal(false)
+.useStorageEngine(STORAGE_ENGINE)
+.verbose(false)
+.build();
+
+  Storage replication = new Storage(null, CONFIG_REPLICA_SET, 0);
 
   IMongodConfig mongodConfig = new MongodConfigBuilder()
-  .version(Version.Main.PRODUCTION)
+  .version(Version.Main.V3_4)
   .net(new Net(LOCALHOST, configServerPort, 
Network.localhostIsIPv6()))
-  .configServer(flag).cmdOptions(cmdOptions).build();
+  .replication(replication)
+  .shardServer(false)
+  .configServer(true).cmdOptions(cmdOptions).build();
--- End diff --

Successfully built Mongo storage plugin on my Mac.  LGTM +1


---


[GitHub] drill issue #1126: DRILL-6179: Added pcapng-format support

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1126
  
@Vlad-Storona could you please rebase to the latest master and confirm that 
PR is ready for review?


---


[GitHub] drill issue #1204: DRILL-6318

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1204
  
@oleg-zinovev could you please rebase to the latest master?


---


[GitHub] drill issue #1224: DRILL-6321: Customize Drill's conformance. Allow support ...

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1224
  
@chunhui-shi could you please address @vrozov comment?
@vvysotskyi could you please alos take a look at PR?


---


[GitHub] drill issue #1233: Updated with links to previous releases

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1233
  
@kkhatua / @parthchandra  should we close the PR or what other work should 
be done?


---


[GitHub] drill issue #1235: DRILL-6336: Inconsistent method name.

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1235
  
@vrozov so you suggest to leave as is, correct?
@paul-rogers since you have originally added `DebugStringBuilder`, do you 
agree?


---


[GitHub] drill issue #1236: DRILL-6347: Inconsistent method name "field".

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/1236
  
@BruceKuiLiu could you please address @vrozov comments?


---


[GitHub] drill pull request #1222: DRILL-6341: Fixed failing tests for mongodb storag...

2018-04-29 Thread arina-ielchiieva
Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/1222#discussion_r184881467
  
--- Diff: 
contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/MongoTestSuit.java
 ---
@@ -128,42 +130,63 @@ private static void setup() throws Exception {
   createDbAndCollections(DATATYPE_DB, DATATYPE_COLLECTION, "_id");
 }
 
-private static IMongodConfig crateConfigServerConfig(int 
configServerPort,
-boolean flag) throws UnknownHostException, IOException {
-  IMongoCmdOptions cmdOptions = new 
MongoCmdOptionsBuilder().useNoJournal(false).verbose(false)
-  .build();
+private static IMongodConfig crateConfigServerConfig(int 
configServerPort) throws UnknownHostException, IOException {
+  IMongoCmdOptions cmdOptions = new MongoCmdOptionsBuilder()
+.useNoPrealloc(false)
+.useSmallFiles(false)
+.useNoJournal(false)
+.useStorageEngine(STORAGE_ENGINE)
+.verbose(false)
+.build();
+
+  Storage replication = new Storage(null, CONFIG_REPLICA_SET, 0);
 
   IMongodConfig mongodConfig = new MongodConfigBuilder()
-  .version(Version.Main.PRODUCTION)
+  .version(Version.Main.V3_4)
   .net(new Net(LOCALHOST, configServerPort, 
Network.localhostIsIPv6()))
-  .configServer(flag).cmdOptions(cmdOptions).build();
+  .replication(replication)
+  .shardServer(false)
+  .configServer(true).cmdOptions(cmdOptions).build();
--- End diff --

Please move to the new lines `.cmdOptions(cmdOptions).build();`


---


Re: Display column data type without code

2018-04-29 Thread Paul Rogers
Turns out I really needed better type functions in order to explain the nuances 
of Drill types, so I went ahead and created them.

See DRILL-6361, PR #1242 [1]. Examples shown in the PR. Reviewers very much 
appreciated.

Thanks,
- Paul

[1] https://github.com/apache/drill/pull/1242

 

On Saturday, April 28, 2018, 5:58:47 PM PDT, Charles Givre 
 wrote:  
 
 I’d like to weigh in here, but this would be EXTREMELY useful.  When I was 
trying to write connectors to enable various BI tools to connect to Drill, such 
as SQLPad and Metabase, the inability to get information about how drill 
interprets the data was really difficult to get around.  Just me .02. 

> On Apr 28, 2018, at 18:05, Paul Rogers  wrote:
> 
> Hi Rob,
> 
> Thanks for the suggestion. While this works for Hive (as you showed), it does 
> not work for CSV files:
> 
> DESCRIBE `csvh/cust.csvh`;
> +--++--+
> | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> +--++--+
> +--++--+
> 
> The typeof() function is handy, but does not report the "is nullable" (or 
> repeated) "mode" of a column, and it loses the data type if a value is null. 
> The following CSV file (with headers) uses non-nullable VARCHAR columns:
> 
> SELECT typeof(custId) FROM `csvh/cust.csvh`;
> +--+
> |  EXPR$0  |
> +--+
> | VARCHAR  |
> +--+
> 
> Now, do something similar with JSON which uses a (nullable) VARCHAR:
> 
> SELECT typeof(a) FROM `json/str-null.json`;
> +--+
> |  EXPR$0  |
> +--+
> | VARCHAR  |
> | NULL    |
> +--+
> 
> Finally, use a CSV file without headers, so that all columns are returned in 
> the columns[] array:
> 
> SELECT typeof(columns) FROM `csv/cust.csv`;
> +--+
> |  EXPR$0  |
> +--+
> | VARCHAR  |
> +--+
> 
> We know that the three "VARCHAR" are different because we know how Drill 
> works internally. But, the output of sqlline does not express that knowledge.
> 
> Sqlline presents all data as strings, which often hides the data type and 
> other details, making lit look like things work better than they actually do. 
> You can see this by running a query against two JSON where a VarChar column 
> is missing from one of the files. Drill guesses "nullable Int",  Sqlline 
> shows the value as null, and typeof() shows the type as NULL, hiding the fact 
> that there is actually a schema conflict (schema change) lurking in the data 
> that manifests only if, say, you sort the data.
> 
> Bottom line: it seems that, at present, there isn't a good way at present 
> (short of writing some Java code that uses the native Drill API) to get the 
> actually, detailed type of a column with both data type and cardinality 
> ("mode").
> 
> 
> So, would be great when explaining Drill concepts, if there was a clean 
> non-code way to show people the actual structure of the data. (Yep, I know 
> Drill is open source and welcomes contributions, so I'll try to offer a 
> solution when I get time...)
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Thursday, April 26, 2018, 10:08:04 AM PDT, Rob Wu  
>wrote:  
> 
> Hi Paul,
> 
> You could also use DESCRIBE (https://drill.apache.org/docs/describe/).
> 
> 0: jdbc:drill:drillbit=localhost:31010> describe
> `hive.default`.`integer_table`
> . . . . . . . . . . . . . . . . . . . > ;
> +--++--+
> | COLUMN_NAME  |    DATA_TYPE      | IS_NULLABLE  |
> +--++--+
> | keycolumn    | CHARACTER VARYING  | YES          |
> | column1      | INTEGER            | YES          |
> +--++
> 
> Best regards,
> 
> Rob
> 
> On Wed, Apr 25, 2018 at 10:12 PM, Abhishek Girish 
> wrote:
> 
>> Hey Paul,
>> 
>> You could use the typeof() function for this purpose. It takes a single
>> parameter - the column name.
>> 
>> For example:
>>> select typeof(c_current_cdemo_sk) from customer limit 1;
>> +-+
>> | EXPR$0  |
>> +-+
>> | BIGINT  |
>> +-+
>> 1 row selected (0.472 seconds)
>> 
>> 
>> On Wed, Apr 25, 2018 at 9:23 PM Paul Rogers 
>> wrote:
>> 
>>> Hi All,
>>> Anyone know if there is a non-code way to display the data types of
>>> columns returned from a Drill query? Sqlline appears to only show the
>>> column names and values. The same is true of the Drill web console.
>>> The EXPLAIN PLAN FOR ... command shows the query plan, but not type
>> (which
>>> are only known at run time.) Is there a statement, system table or some
>>> other trick to display column types in, say, Sqlline?
>>> In the past, I've gotten the types by using unit test style code. But,
>>> that is not to handy for use as an example for non-developers...
>>> Thanks,
>>> - Paul
>>> 
>>> 
>> 
  

[GitHub] drill pull request #1242: DRILL-6361: Revised typeOf() function versions

2018-04-29 Thread paul-rogers
GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/1242

DRILL-6361: Revised typeOf() function versions

Drill provides the `typeof()` function to return the type of a column. 
However, this function has two key limitations:

1. It returns NULL if any column value is NULL. But, Drill has no NULL 
type, so this masks the underlying type. This is especially annoying for 
columns which are all NULL, such as "missing" columns.
2. It does not return the cardinality (AKA "mode") of the column.

This PR introduces two new functions that solve these issues.

### New Functions

`sqlTypeOf()` returns the data type (using the SQL names) whether the 
column is NULL or not. The SQL name is the one that can be used in a CAST 
statement. Thus,

```
sqlTypeOf( CAST(x AS  ))
```

returns type> as the type name.

`modeOf()` returns the cardinality (mode) of the column as "NOT NULL", 
"NULLABLE" or "ARRAY". (Suggestions for better terms are welcome.) The Drill 
terms are not used because they are more Parquet-like than SQL-like.

Finally, the `drillTypeOf()` function that works just like `sqlTypeOf()`, 
but returns the internal Drill names.

### Example

Here is an example usage that highlights our old friend, "nullable int" for 
a missing column:

```
SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode FROM 
`json/all-null.json`;

+--+---+
|  a_type  |  a_mode   |
+--+---+
| INTEGER  | NULLABLE  |
+--+---+
```

For arrays (repeated) types:

```
SELECT sqlTypeOf(columns) as col_type, modeOf(columns) as col_mode
FROM `csv/cust.csv`;

++---+
|  col_type  | col_mode  |
++---+
| CHARACTER VARYING  | ARRAY |
++---+
```

For non-null types:

```
SELECT sqlTypeOf(`name`) AS name_type, 
modeOf(`name`) AS name_mode FROM `csvh/cust.csvh`;

+++
| name_type  | name_mode  |
+++
| CHARACTER VARYING  | NOT NULL   |
+++
```

The result is that the internal Drill type is made very plain to the user 
of `sqlline`.

### UDF Utility Methods

To save some typing, this PR also includes a few helper functions to make 
it easier to write UDFs. These functions were first described in the blog post 
[UDF Background 
Information](https://github.com/paul-rogers/drill/wiki/UDFs-Background-Information),
 on the 
[Troublshooting](https://github.com/paul-rogers/drill/wiki/UDF-Troubleshooting) 
page.

In particular, to return a string, the old `typeof()` implementation uses:

```
  byte[] type = typeName.getBytes();
  buf = buf.reallocIfNeeded(type.length);
  buf.setBytes(0, type);
  out.buffer = buf;
  out.start = 0;
  out.end = type.length;
```

While the new functions use:

```
  
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharOutput(
typeName, buf, out);
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/paul-rogers/drill DRILL-6361

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1242.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1242


commit 7acf6cc77581c15981cf5cc7ac1a2b3780324f40
Author: Paul Rogers 
Date:   2018-04-29T06:04:26Z

DRILL-6361: Revised typeOf() function versions




---