[jira] [Created] (DRILL-5229) Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0

2017-01-26 Thread Rahul Raj (JIRA)
Rahul Raj created DRILL-5229:


 Summary: Upgrade kudu client to org.apache.kudu:kudu-client:1.2.0 
 Key: DRILL-5229
 URL: https://issues.apache.org/jira/browse/DRILL-5229
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Other
Affects Versions: 1.8.0
Reporter: Rahul Raj
 Fix For: 2.0.0


Getting an error -" out-of-order key" for a query select v,count(k) from
kudu.test group by v where k is the primary key. This happens only when the
aggregation is done on primary key. Should drill move to the latest kudu
client to investigate this further?

Current drill kudu connector uses org.kududb:kudu-client:0.6.0 from
cloudera repository, where the latest released library
org.apache.kudu:kudu-client:1.2.0 is hosted on maven central. There are a
few breaking changes with the new library:

   1. TIMESTAMP renamed to UNIXTIME_MICROS
   2. In KuduRecordReader#setup -
   KuduScannerBuilder#lowerBoundPartitionKeyRaw renamed to lowerBoundRaw
   andKuduScannerBuilder#exclusiveUpperBoundPartitionKeyRaw renamed
   exclusiveUpperBoundRaw. Both methods are deprecated.
   3. In KuduRecordWriterImpl#updateSchema - client.createTable(name,
   kuduSchema) requires CreateTableOperatios as the third argument



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Data types

2017-01-26 Thread Paul Rogers
Looks like I gave you advice that as a bit off. The function you want is either:

this.buffer = fragmentContext.getManagedBuffer();

The above allocates a 256 byte buffer. You can initially allocate a larger one:

this.buffer = fragmentContext.getManagedBuffer(4096);

Or, to reallocate:

   buffer = fragmentContext.replace(buffer, 8192);

Again, I’ve not used these method myself, but they seem they might do the trick.

- Paul

> On Jan 26, 2017, at 9:51 PM, Charles Givre  wrote:
> 
> Thanks!  I’m hoping to submit a PR eventually once I have this all done.  I 
> tried your changes and now I’m getting this error:
> 
> 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
> Error: DATA_READ ERROR: Tried to remove unmanaged buffer.
> 
> Fragment 0:0
> 
> [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
> charless-mbp-2.fios-router.home:31010] (state=,code=0)
> 
> 
> 
> 
>> On Jan 26, 2017, at 23:08, Paul Rogers  wrote:
>> 
>> Hi Charles,
>> 
>> Very cool plugin!
>> 
>> My knowledge in this area is a bit sketchy… That said, the problem appears 
>> to be that the code does not extend the Drillbuf to ensure it has sufficient 
>> capacity. Try calling this method: reallocIfNeeded, something like this:
>> 
>>  this.buffer.reallocIfNeeded(stringLength);
>>  this.buffer.setBytes(0, bytes, 0, stringLength);
>>  map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>> 
>> Then, comment out the 256 length hack and see if it works.
>> 
>> To avoid memory fragmentation, maybe change your loop as:
>> 
>>   int maxRecords = MAX_RECORDS_PER_BATCH;
>>   int maxWidth = 256;
>>   while(recordCount < maxRecords &&(line = this.reader.readLine()) 
>> != null){
>>   …
>>  if(stringLength > maxWidth) {
>> maxWidth = stringLength;
>> maxRecords = 16 * 1024 * 1024 / maxWidth;
>>  }
>> 
>> The above is not perfect (the last record added might be much larger than 
>> the others, causing the corresponding vector to grow larger than 16 MB, but 
>> the occasional large vector should be OK.)
>> 
>> Thanks,
>> 
>> - Paul
>> 
>> On Jan 26, 2017, at 5:31 PM, Charles Givre 
>> > wrote:
>> 
>> Hi Paul,
>> Would you mind taking a look at my code?  I’m wondering if I’m doing this 
>> correctly.  Just for context, I’m working on a generic log file reader for 
>> drill (https://github.com/cgivre/drill-logfile-plugin 
>> ), and I encountered some 
>> errors when working with fields that were > 256 characters long.  It isn’t a 
>> storage plugin, but it extends the EasyFormatPlugin.
>> 
>> I added some code to truncate the strings to 256 chars, it worked.  Before 
>> this it was throwing errors as shown below:
>> 
>> 
>> 
>> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
>> 
>> Fragment 0:0
>> 
>> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>> 
>> 
>> The query that generated this was just a SELECT * FROM dfs.`file`.  Also, 
>> how do I set the size of each row batch?
>> Thank you for your help.
>> — C
>> 
>> 
>> if (m.find()) {
>>  for( int i = 1; i <= m.groupCount(); i++ )
>>  {
>>  //TODO Add option for date fields
>>  String fieldName  = fieldNames.get(i - 1);
>>  String fieldValue;
>> 
>>  fieldValue = m.group(i);
>> 
>>  if( fieldValue == null){
>>  fieldValue = "";
>>  }
>>  byte[] bytes = fieldValue.getBytes("UTF-8");
>> 
>> //Added this and it worked….
>>  int stringLength = bytes.length;
>>  if( stringLength > 256 ){
>>  stringLength = 256;
>>  }
>> 
>>  this.buffer.setBytes(0, bytes, 0, stringLength);
>>  map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>  }
>> 
>> 
>> 
>> 
>> On Jan 26, 2017, at 20:20, Paul Rogers 
>> > wrote:
>> 
>> Hi Charles,
>> 
>> The Varchar column can hold any length of data. We’ve recently been working 
>> on tests that have columns up to 8K in length.
>> 
>> The one caveat is that, when working with data larger than 256 bytes, you 
>> must be extremely careful in your reader. The out-of-box text reader will 
>> always read 64K rows. This (due to various issues) can cause memory 
>> fragmentation and OOM errors when used with columns greater than 256 bytes 
>> in width.
>> 
>> If you are developing your own storage plugin, then adjust the size of each 
>> row batch so that no single vector is larger than 16 MB in size. Then you 
>> can use any size of column.
>> 
>> Suppose your logs contain text lines up to, say, 1K in size. This means that 
>> each record batch your reader produces must be of size less than 16 MB / 1K 
>> / row = 1600 rows (rather than the usual 64K.)
>> 
>> Once the data is in the Varchar column, the rest of 

Re: Data types

2017-01-26 Thread Charles Givre
Thanks!  I’m hoping to submit a PR eventually once I have this all done.  I 
tried your changes and now I’m getting this error:

0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
Error: DATA_READ ERROR: Tried to remove unmanaged buffer.

Fragment 0:0

[Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)




> On Jan 26, 2017, at 23:08, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> Very cool plugin!
> 
> My knowledge in this area is a bit sketchy… That said, the problem appears to 
> be that the code does not extend the Drillbuf to ensure it has sufficient 
> capacity. Try calling this method: reallocIfNeeded, something like this:
> 
>   this.buffer.reallocIfNeeded(stringLength);
>   this.buffer.setBytes(0, bytes, 0, stringLength);
>   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
> 
> Then, comment out the 256 length hack and see if it works.
> 
> To avoid memory fragmentation, maybe change your loop as:
> 
>int maxRecords = MAX_RECORDS_PER_BATCH;
>int maxWidth = 256;
>while(recordCount < maxRecords &&(line = this.reader.readLine()) 
> != null){
>…
>   if(stringLength > maxWidth) {
>  maxWidth = stringLength;
>  maxRecords = 16 * 1024 * 1024 / maxWidth;
>   }
> 
> The above is not perfect (the last record added might be much larger than the 
> others, causing the corresponding vector to grow larger than 16 MB, but the 
> occasional large vector should be OK.)
> 
> Thanks,
> 
> - Paul
> 
> On Jan 26, 2017, at 5:31 PM, Charles Givre 
> > wrote:
> 
> Hi Paul,
> Would you mind taking a look at my code?  I’m wondering if I’m doing this 
> correctly.  Just for context, I’m working on a generic log file reader for 
> drill (https://github.com/cgivre/drill-logfile-plugin 
> ), and I encountered some 
> errors when working with fields that were > 256 characters long.  It isn’t a 
> storage plugin, but it extends the EasyFormatPlugin.
> 
> I added some code to truncate the strings to 256 chars, it worked.  Before 
> this it was throwing errors as shown below:
> 
> 
> 
> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
> 
> Fragment 0:0
> 
> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
> charless-mbp-2.fios-router.home:31010] (state=,code=0)
> 
> 
> The query that generated this was just a SELECT * FROM dfs.`file`.  Also, how 
> do I set the size of each row batch?
> Thank you for your help.
> — C
> 
> 
> if (m.find()) {
>   for( int i = 1; i <= m.groupCount(); i++ )
>   {
>   //TODO Add option for date fields
>   String fieldName  = fieldNames.get(i - 1);
>   String fieldValue;
> 
>   fieldValue = m.group(i);
> 
>   if( fieldValue == null){
>   fieldValue = "";
>   }
>   byte[] bytes = fieldValue.getBytes("UTF-8");
> 
> //Added this and it worked….
>   int stringLength = bytes.length;
>   if( stringLength > 256 ){
>   stringLength = 256;
>   }
> 
>   this.buffer.setBytes(0, bytes, 0, stringLength);
>   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>   }
> 
> 
> 
> 
> On Jan 26, 2017, at 20:20, Paul Rogers 
> > wrote:
> 
> Hi Charles,
> 
> The Varchar column can hold any length of data. We’ve recently been working 
> on tests that have columns up to 8K in length.
> 
> The one caveat is that, when working with data larger than 256 bytes, you 
> must be extremely careful in your reader. The out-of-box text reader will 
> always read 64K rows. This (due to various issues) can cause memory 
> fragmentation and OOM errors when used with columns greater than 256 bytes in 
> width.
> 
> If you are developing your own storage plugin, then adjust the size of each 
> row batch so that no single vector is larger than 16 MB in size. Then you can 
> use any size of column.
> 
> Suppose your logs contain text lines up to, say, 1K in size. This means that 
> each record batch your reader produces must be of size less than 16 MB / 1K / 
> row = 1600 rows (rather than the usual 64K.)
> 
> Once the data is in the Varchar column, the rest of Drill should “just work” 
> on that data.
> 
> - Paul
> 
> On Jan 26, 2017, at 4:11 PM, Charles Givre 
> > wrote:
> 
> I’m working on a plugin to read log files and the data has some long strings. 
>  Is there a data type that can hold strings longer than 256 characters?
> Thanks,
> — Charles
> 
> 
> 



[jira] [Created] (DRILL-5228) Several operators in the attached query profile take more time than expected

2017-01-26 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-5228:


 Summary: Several operators in the attached query profile take more 
time than expected
 Key: DRILL-5228
 URL: https://issues.apache.org/jira/browse/DRILL-5228
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.10.0
Reporter: Rahul Challapalli


Environment :
{code}
git.commit.id.abbrev=2af709f
DRILL_MAX_DIRECT_MEMORY="32G"
DRILL_MAX_HEAP="4G"
{code}

Data Set : 
{code}
Size : ~18 GB
No Of Columns : 1
Column Width : 256 bytes
{code}

Query ( took ~127 minutes to complete)
{code}
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.disable_exchanges` = true;
alter session set `planner.memory.max_query_memory_per_node` = 14106127360;
select * from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl` 
order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf';
{code}

*Selection Vector Remover*
{code}
Time Spent based on profile : 7m58s
Problem : Since the external sort spilled to the disk in this case, the 
selection vector remover should have been an no-op. There is no clear 
justification for the time spent
{code}

*Text Sub Scan*
{code}
Time spent based on profile : 13m25s
Problem : I captured the profile screenshot (before-spill.png) once the memory 
allocation for the sort reached its limit. Based on this the scan took 2m13s 
for reading the first 12.48GB of data before sorting/spilling began. For the 
remaining ~5.5 GB it took  ~11 minutes.
{code}

*Projects*
{code}
Timings for the 4 projects based on profile. While I do not have a good reason 
to suspect, these numbers seemed high.
Project 1 : 4m54s
Project 2 : 3m07s
Project 3 : 4m10s
Project 4 : 0.003s
{code}

The time spent in the external sort based on the profile is wrong. DRILL-5227 
is reported for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Data types

2017-01-26 Thread Paul Rogers
Hi Charles,

Very cool plugin!

My knowledge in this area is a bit sketchy… That said, the problem appears to 
be that the code does not extend the Drillbuf to ensure it has sufficient 
capacity. Try calling this method: reallocIfNeeded, something like this:

   this.buffer.reallocIfNeeded(stringLength);
   this.buffer.setBytes(0, bytes, 0, stringLength);
   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);

Then, comment out the 256 length hack and see if it works.

To avoid memory fragmentation, maybe change your loop as:

int maxRecords = MAX_RECORDS_PER_BATCH;
int maxWidth = 256;
while(recordCount < maxRecords &&(line = this.reader.readLine()) != 
null){
…
   if(stringLength > maxWidth) {
  maxWidth = stringLength;
  maxRecords = 16 * 1024 * 1024 / maxWidth;
   }

The above is not perfect (the last record added might be much larger than the 
others, causing the corresponding vector to grow larger than 16 MB, but the 
occasional large vector should be OK.)

Thanks,

- Paul

On Jan 26, 2017, at 5:31 PM, Charles Givre 
> wrote:

Hi Paul,
Would you mind taking a look at my code?  I’m wondering if I’m doing this 
correctly.  Just for context, I’m working on a generic log file reader for 
drill (https://github.com/cgivre/drill-logfile-plugin 
), and I encountered some 
errors when working with fields that were > 256 characters long.  It isn’t a 
storage plugin, but it extends the EasyFormatPlugin.

I added some code to truncate the strings to 256 chars, it worked.  Before this 
it was throwing errors as shown below:



Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))

Fragment 0:0

[Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)


The query that generated this was just a SELECT * FROM dfs.`file`.  Also, how 
do I set the size of each row batch?
Thank you for your help.
— C


if (m.find()) {
   for( int i = 1; i <= m.groupCount(); i++ )
   {
   //TODO Add option for date fields
   String fieldName  = fieldNames.get(i - 1);
   String fieldValue;

   fieldValue = m.group(i);

   if( fieldValue == null){
   fieldValue = "";
   }
   byte[] bytes = fieldValue.getBytes("UTF-8");

//Added this and it worked….
   int stringLength = bytes.length;
   if( stringLength > 256 ){
   stringLength = 256;
   }

   this.buffer.setBytes(0, bytes, 0, stringLength);
   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
   }




On Jan 26, 2017, at 20:20, Paul Rogers 
> wrote:

Hi Charles,

The Varchar column can hold any length of data. We’ve recently been working on 
tests that have columns up to 8K in length.

The one caveat is that, when working with data larger than 256 bytes, you must 
be extremely careful in your reader. The out-of-box text reader will always 
read 64K rows. This (due to various issues) can cause memory fragmentation and 
OOM errors when used with columns greater than 256 bytes in width.

If you are developing your own storage plugin, then adjust the size of each row 
batch so that no single vector is larger than 16 MB in size. Then you can use 
any size of column.

Suppose your logs contain text lines up to, say, 1K in size. This means that 
each record batch your reader produces must be of size less than 16 MB / 1K / 
row = 1600 rows (rather than the usual 64K.)

Once the data is in the Varchar column, the rest of Drill should “just work” on 
that data.

- Paul

On Jan 26, 2017, at 4:11 PM, Charles Givre 
> wrote:

I’m working on a plugin to read log files and the data has some long strings.  
Is there a data type that can hold strings longer than 256 characters?
Thanks,
— Charles





[jira] [Created] (DRILL-5227) Wrong time reported in the query profile for the external sort

2017-01-26 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-5227:


 Summary: Wrong time reported in the query profile for the external 
sort
 Key: DRILL-5227
 URL: https://issues.apache.org/jira/browse/DRILL-5227
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators, Web Server
Affects Versions: 1.10.0
Reporter: Rahul Challapalli


git.commit.id.abbrev=2af709f

Data Set :
{code}
Size : ~18 GB
No Of Columns : 1
Column Width : 256 bytes
{code}

The below query took ~ 127 minutes. However the profile indicated that the 
External Sort itself took 17h27m. Something is wrong.

{code}
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.disable_exchanges` = true;
alter session set `planner.memory.max_query_memory_per_node` = 14106127360;
select * from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl` 
order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf'
{code}

I attached the query profile. The data set and the logs are too large to attach 
to a jira



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Data types

2017-01-26 Thread Charles Givre
Hi Paul, 
Would you mind taking a look at my code?  I’m wondering if I’m doing this 
correctly.  Just for context, I’m working on a generic log file reader for 
drill (https://github.com/cgivre/drill-logfile-plugin 
), and I encountered some 
errors when working with fields that were > 256 characters long.  It isn’t a 
storage plugin, but it extends the EasyFormatPlugin. 

I added some code to truncate the strings to 256 chars, it worked.  Before this 
it was throwing errors as shown below:



Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))

Fragment 0:0

[Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
charless-mbp-2.fios-router.home:31010] (state=,code=0)


The query that generated this was just a SELECT * FROM dfs.`file`.  Also, how 
do I set the size of each row batch?
Thank you for your help.
— C


if (m.find()) {
for( int i = 1; i <= m.groupCount(); i++ )
{
//TODO Add option for date fields
String fieldName  = fieldNames.get(i - 1);
String fieldValue;

fieldValue = m.group(i);

if( fieldValue == null){
fieldValue = "";
}
byte[] bytes = fieldValue.getBytes("UTF-8");

//Added this and it worked….
int stringLength = bytes.length;
if( stringLength > 256 ){
stringLength = 256;
}

this.buffer.setBytes(0, bytes, 0, stringLength);
map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
}




> On Jan 26, 2017, at 20:20, Paul Rogers  wrote:
> 
> Hi Charles,
> 
> The Varchar column can hold any length of data. We’ve recently been working 
> on tests that have columns up to 8K in length.
> 
> The one caveat is that, when working with data larger than 256 bytes, you 
> must be extremely careful in your reader. The out-of-box text reader will 
> always read 64K rows. This (due to various issues) can cause memory 
> fragmentation and OOM errors when used with columns greater than 256 bytes in 
> width.
> 
> If you are developing your own storage plugin, then adjust the size of each 
> row batch so that no single vector is larger than 16 MB in size. Then you can 
> use any size of column.
> 
> Suppose your logs contain text lines up to, say, 1K in size. This means that 
> each record batch your reader produces must be of size less than 16 MB / 1K / 
> row = 1600 rows (rather than the usual 64K.)
> 
> Once the data is in the Varchar column, the rest of Drill should “just work” 
> on that data.
> 
> - Paul
> 
>> On Jan 26, 2017, at 4:11 PM, Charles Givre  wrote:
>> 
>> I’m working on a plugin to read log files and the data has some long 
>> strings.  Is there a data type that can hold strings longer than 256 
>> characters?
>> Thanks,
>> — Charles
> 



Re: Data types

2017-01-26 Thread Paul Rogers
Hi Charles,

The Varchar column can hold any length of data. We’ve recently been working on 
tests that have columns up to 8K in length.

The one caveat is that, when working with data larger than 256 bytes, you must 
be extremely careful in your reader. The out-of-box text reader will always 
read 64K rows. This (due to various issues) can cause memory fragmentation and 
OOM errors when used with columns greater than 256 bytes in width.

If you are developing your own storage plugin, then adjust the size of each row 
batch so that no single vector is larger than 16 MB in size. Then you can use 
any size of column.

Suppose your logs contain text lines up to, say, 1K in size. This means that 
each record batch your reader produces must be of size less than 16 MB / 1K / 
row = 1600 rows (rather than the usual 64K.)

Once the data is in the Varchar column, the rest of Drill should “just work” on 
that data.

- Paul

> On Jan 26, 2017, at 4:11 PM, Charles Givre  wrote:
> 
> I’m working on a plugin to read log files and the data has some long strings. 
>  Is there a data type that can hold strings longer than 256 characters?
> Thanks,
> — Charles



Data types

2017-01-26 Thread Charles Givre
I’m working on a plugin to read log files and the data has some long strings.  
Is there a data type that can hold strings longer than 256 characters?
Thanks,
— Charles 

[jira] [Created] (DRILL-5226) External Sort encountered an error while spilling to disk

2017-01-26 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-5226:


 Summary: External Sort encountered an error while spilling to disk
 Key: DRILL-5226
 URL: https://issues.apache.org/jira/browse/DRILL-5226
 Project: Apache Drill
  Issue Type: Bug
  Components: Execution - Relational Operators
Affects Versions: 1.10.0
Reporter: Rahul Challapalli


Environment : 
{code}
git.commit.id.abbrev=2af709f
DRILL_MAX_DIRECT_MEMORY="32G"
DRILL_MAX_HEAP="4G"
Nodes in Mapr Cluster : 1
Data Size : ~ 0.35 GB
No of Columns : 1
Width of column : 256 chars
{code}

The below query fails before spilling to disk due to wrong estimates of the 
record batch size.
{code}
0: jdbc:drill:zk=10.10.100.190:5181> alter session set 
`planner.width.max_per_node` = 1;
+---+--+
|  ok   |   summary|
+---+--+
| true  | planner.width.max_per_node updated.  |
+---+--+
1 row selected (1.11 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> alter session set 
`planner.memory.max_query_memory_per_node` = 62914560;
+---++
|  ok   |  summary   |
+---++
| true  | planner.memory.max_query_memory_per_node updated.  |
+---++
1 row selected (0.362 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> alter session set 
`planner.disable_exchanges` = true;
+---+-+
|  ok   |   summary   |
+---+-+
| true  | planner.disable_exchanges updated.  |
+---+-+
1 row selected (0.277 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> select * from (select * from 
dfs.`/drill/testdata/resource-manager/250wide-small.tbl` order by columns[0])d 
where d.columns[0] = 'ljdfhwuehnoiueyf';
Error: RESOURCE ERROR: External Sort encountered an error while spilling to disk

Unable to allocate buffer of size 1048576 (rounded from 618889) due to memory 
limit. Current allocation: 62736000
Fragment 0:0

[Error Id: 1bb933c8-7dc6-4cbd-8c8e-0e095baac719 on qa-node190.qa.lab:31010] 
(state=,code=0)
{code}

Exception from the logs
{code}
2017-01-26 15:33:09,307 [277578d5-8bea-27db-0da1-cec0f53a13df:frag:0:0] INFO  
o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred: External Sort 
encountered an error while spilling to disk (Unable to allocate buffer of size 
1048576 (rounded from 618889) due to memory limit. Current allocation: 62736000)
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External Sort 
encountered an error while spilling to disk

Unable to allocate buffer of size 1048576 (rounded from 618889) due to memory 
limit. Current allocation: 62736000

[Error Id: 1bb933c8-7dc6-4cbd-8c8e-0e095baac719 ]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
 ~[drill-common-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:603)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext(ExternalSortBatch.java:411)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215)
 [drill-java-exec-1.10.0-SNAPSHOT.jar:1.10.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
 

[GitHub] drill pull request #729: Drill 1328 r4

2017-01-26 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r98108433
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/AnalyzeTableHandler.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.sql.handlers;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.schema.Table;
+import org.apache.calcite.sql.SqlIdentifier;
+import org.apache.calcite.sql.SqlNode;
+import org.apache.calcite.sql.SqlNodeList;
+import org.apache.calcite.sql.SqlSelect;
+import org.apache.calcite.sql.parser.SqlParserPos;
+import org.apache.calcite.tools.RelConversionException;
+import org.apache.calcite.tools.ValidationException;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.logical.FormatPluginConfig;
+import org.apache.drill.exec.dotdrill.DotDrillType;
+import org.apache.drill.exec.physical.PhysicalPlan;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.planner.logical.DrillAnalyzeRel;
+import org.apache.drill.exec.planner.logical.DrillRel;
+import org.apache.drill.exec.planner.logical.DrillScreenRel;
+import org.apache.drill.exec.planner.logical.DrillStoreRel;
+import org.apache.drill.exec.planner.logical.DrillWriterRel;
+import org.apache.drill.exec.planner.logical.DrillTable;
+import org.apache.drill.exec.planner.physical.Prel;
+import org.apache.drill.exec.planner.sql.DirectPlan;
+import org.apache.drill.exec.planner.sql.SchemaUtilites;
+import org.apache.drill.exec.planner.sql.parser.SqlAnalyzeTable;
+import org.apache.drill.exec.store.AbstractSchema;
+import org.apache.drill.exec.store.dfs.DrillFileSystem;
+import org.apache.drill.exec.store.dfs.FileSystemPlugin;
+import org.apache.drill.exec.store.dfs.FormatSelection;
+import org.apache.drill.exec.store.dfs.NamedFormatPluginConfig;
+import org.apache.drill.exec.store.parquet.ParquetFormatConfig;
+import org.apache.drill.exec.util.Pointer;
+import org.apache.drill.exec.work.foreman.ForemanSetupException;
+import org.apache.drill.exec.work.foreman.SqlUnsupportedException;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FileStatus;
+
+public class AnalyzeTableHandler extends DefaultSqlHandler {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(AnalyzeTableHandler.class);
+
+  public AnalyzeTableHandler(SqlHandlerConfig config, Pointer 
textPlan) {
+super(config, textPlan);
+  }
+
+  @Override
+  public PhysicalPlan getPlan(SqlNode sqlNode)
+  throws ValidationException, RelConversionException, IOException, 
ForemanSetupException {
+final SqlAnalyzeTable sqlAnalyzeTable = unwrap(sqlNode, 
SqlAnalyzeTable.class);
+
+verifyNoUnsupportedFunctions(sqlAnalyzeTable);
+
+SqlIdentifier tableIdentifier = sqlAnalyzeTable.getTableIdentifier();
+SqlSelect scanSql = new SqlSelect(
+SqlParserPos.ZERO,  /* position */
+SqlNodeList.EMPTY,  /* keyword list */
+getColumnList(sqlAnalyzeTable), /* select list */
+tableIdentifier,/* from */
+null,   /* where */
+null,   /* group by */
+null,   /* having */
+null,   /* windowDecls */
+null,   /* orderBy */
+null,   /* offset */
+null/* fetch */
+);
+
+final ConvertedRelNode convertedRelNode = 
validateAndConvert(rewrite(scanSql));
+final RelNode relScan = convertedRelNode.getConvertedNode();
+final String tableName = 

[GitHub] drill pull request #729: Drill 1328 r4

2017-01-26 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r98101328
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -169,4 +175,43 @@ private static boolean containIdentity(List exps,
 }
 return true;
   }
+
+  /**
+   * Returns whether statistics-based estimates or guesses are used by the 
optimizer
+   * */
+  public static boolean guessRows(RelNode rel) {
+final PlannerSettings settings =
+
rel.getCluster().getPlanner().getContext().unwrap(PlannerSettings.class);
+if (!settings.useStatistics()) {
+  return true;
+}
+if (rel instanceof RelSubset) {
--- End diff --

It is unclear why RelSubset and HepRelVertex are treated in a special way. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328 r4

2017-01-26 Thread amansinha100
Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r98077612
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/NestedLoopJoinPrule.java
 ---
@@ -84,8 +88,14 @@ public void onMatch(RelOptRuleCall call) {
 if (!settings.isNestedLoopJoinEnabled()) {
   return;
 }
-
-final DrillJoinRel join = (DrillJoinRel) call.rel(0);
+int[] joinFields = new int[2];
+DrillJoinRel join = (DrillJoinRel) call.rel(0);
+// If right outer join on simply equi join convert it to left outer 
join. We only support left outer NLJ as of now
+if (join.getJoinType() == JoinRelType.RIGHT
--- End diff --

Can you do this change as part of a separate JIRA since it is unrelated to 
statistics. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #656: DRILL-5034: Select timestamp from hive generated pa...

2017-01-26 Thread bitblender
Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/656#discussion_r98070065
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderUtility.java
 ---
@@ -323,18 +323,28 @@ public static DateCorruptionStatus 
checkForCorruptDateValuesInStatistics(Parquet
* @param binaryTimeStampValue
*  hive, impala timestamp values with nanoseconds precision
*  are stored in parquet Binary as INT96 (12 constant bytes)
-   *
+   * @param retainLocalTimezone
+   *  parquet files don't keep local timeZone according to the
+   *  https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#timestamp;>Parquet
 spec,
+   *  but some tools (hive, for example) retain local timezone for 
parquet files by default
+   *  Note: Impala doesn't retain local timezone by default
* @return  Unix Timestamp - the number of milliseconds since January 1, 
1970, 00:00:00 GMT
*  represented by @param binaryTimeStampValue .
*/
-public static long getDateTimeValueFromBinary(Binary 
binaryTimeStampValue) {
+public static long getDateTimeValueFromBinary(Binary 
binaryTimeStampValue, boolean retainLocalTimezone) {
   // This method represents binaryTimeStampValue as ByteBuffer, where 
timestamp is stored as sum of
   // julian day number (32-bit) and nanos of day (64-bit)
   NanoTime nt = NanoTime.fromBinary(binaryTimeStampValue);
   int julianDay = nt.getJulianDay();
   long nanosOfDay = nt.getTimeOfDayNanos();
-  return (julianDay - JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH) * 
DateTimeConstants.MILLIS_PER_DAY
+  long dateTime = (julianDay - JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH) * 
DateTimeConstants.MILLIS_PER_DAY
   + nanosOfDay / NANOS_PER_MILLISECOND;
+  if (retainLocalTimezone) {
+return new org.joda.time.DateTime(dateTime, 
org.joda.time.chrono.JulianChronology.getInstance())
+
.withZoneRetainFields(org.joda.time.DateTimeZone.UTC).getMillis();
--- End diff --

Trying to understand this: Why are you calling 
.withZoneRetainFields(org.joda.time.DateTimeZone.UTC) if retainLocalTimezone is 
true ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #710: DRILL-5126: Provide simplified, unified "cluster fixture" ...

2017-01-26 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/710
  
Rebased on master, resolved conflict and squashed commits.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-5225) Needs better error message for table function when having incorrect table path

2017-01-26 Thread Krystal (JIRA)
Krystal created DRILL-5225:
--

 Summary: Needs better error message for table function when having 
incorrect table path
 Key: DRILL-5225
 URL: https://issues.apache.org/jira/browse/DRILL-5225
 Project: Apache Drill
  Issue Type: Bug
  Components: Functions - Drill
Affects Versions: 1.9.0, 1.10.0
Reporter: Krystal


When schema is missing from table path or full path to table is not given, the 
error message displayed is very misleading.  For example, the query below runs 
successfully with correct table path:

select columns[0],columns[1],columns[2] from 
table(`dfs.drillTestDir`.`/table_function/header.csv`(type=>'text',lineDelimiter=>'\r\n',fieldDelimiter=>',',skipFirstLine=>true));
+-+-+-+
| EXPR$0  | EXPR$1  | EXPR$2  |
+-+-+-+
| 1   | aaa | bbb |
| 2   | ccc | ddd |
| 3   | eee | null|
| 4   | fff | ggg |
+-+-+-+

However if the part of the path is left out, for example the schema, then a 
very misleading error is thrown:

SQL>select columns[0] from table(`table_function/cr_lf.csv`(type=>'text', 
lineDelimiter=>'\r\n'))
1: SQLPrepare = [MapR][Drill] (1040) Drill failed to execute the query: select 
columns[0] from table(`table_function/cr_lf.csv`(type=>'text', 
lineDelimiter=>'\r\n'))
[30027]Query execution error. Details:[
SYSTEM ERROR: SqlValidatorException: No match found for function signature 
table_function/cr_lf.csv(type => , lineDelimiter => )

Error should indicate that table `table_function/cr_lf.csv` is not found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-5224) CTTAS: fix errors connected with system path delimiters (Windows)

2017-01-26 Thread Arina Ielchiieva (JIRA)
Arina Ielchiieva created DRILL-5224:
---

 Summary: CTTAS: fix errors connected with system path delimiters 
(Windows)
 Key: DRILL-5224
 URL: https://issues.apache.org/jira/browse/DRILL-5224
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
 Environment: Windows 10
Reporter: Arina Ielchiieva
Assignee: Arina Ielchiieva
 Fix For: 1.10.0


Problem 1:
Error occurs when attempting to create temporary table on Windows:

{noformat}
0: jdbc:drill:zk=local> create temporary table t as select * from sys.version;
Error: SYSTEM ERROR: InvalidPathException: Illegal char <:> at index 4: 
file:///\tmp\3191db8e-279d-4ced-b0e5-32b3b477edfb
{noformat}

Root cause:
when creating temporary directory we merge file system uri, temporary workspace 
location and session id into one path using java.nio.file.Paths.get() but this 
method cannot resolve path when path has different delimiters.

Fix:
Use org.apache.hadoop.fs.Path tools to merge path, path string is normalized 
during creation.
{noformat}
new Path(fs.getUri().toString(), 
temporaryWorkspace.getDefaultLocation()).suffix(sessionId);
{noformat}

Problem 2:
When temporary table is being manually dropped using drop table command, though 
actual table is dropped, remnant folder is left.

Root cause:
Before adding to temporary table to the list of temporary tables, its generated 
name is concatenated with session id (as parent and child folders). 
java.nio.file.Paths.get() is used for concatenation but it preserves current 
system delimiter. When table is being dropped, passed table name is split using 
org.apache.hadoop.fs.Path.SEPARATOR, since it's assumed that path was created 
using org.apache.hadoop.fs.Path tools where path separators are normalized to 
one format disregarding the system.

Fix:
Concatenate session id with generated table name using 
org.apache.hadoop.fs.Path tools.
{noformat}
new Path(sessionId, UUID.randomUUID().toString()).toUri().getPath();
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] drill pull request #730: DRILL-5223:Drill should ensure balanced workload as...

2017-01-26 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/730

DRILL-5223:Drill should ensure balanced workload assignment at node l…

…evel in order to get better query performance.

Please see DRILL-5223 for details:
https://issues.apache.org/jira/browse/DRILL-5223


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-5223

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/730.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #730


commit a1ad4113f59e87a4885c271a53afea648bb6f9c3
Author: Padma Penumarthy 
Date:   2017-01-21T01:57:10Z

DRILL-5223:Drill should ensure balanced workload assignment at node level 
in order to get better query performance




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---