[GitHub] drill pull request #826: DRILL-5379: Set Hdfs Block Size based on Parquet Bl...

2017-05-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/826#discussion_r116895850
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
@@ -380,14 +384,21 @@ public void endRecord() throws IOException {
 
   // since ParquetFileWriter will overwrite empty output file (append 
is not supported)
   // we need to re-apply file permission
-  parquetFileWriter = new ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE);
+  if (useConfiguredBlockSize) {
--- End diff --

What we are doing is create parquet file as single block without changing 
the file system default block size.  For ex. default Parquet block size is 
512MB and if file system block size is 128MB, we create single file with 4 
blocks on filesystem, which can get distributed on different nodes, not good 
for performance. If we change Parquet block size to 128MB (to match with file 
system block size), for same amount of data, we end up creating 4 files, one 
block each, which is not good either. 

JIRA wanted single HDFS block per Parquet file that is larger than file 
system block size , without changing file system block size.  They had file 
system block size configured as 128MB. Lowering parquet block size (from 
default value of 512MB) to match with file system block size is creating too 
many files for them. For some other reasons, they are not able to change file 
system block size. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DRILL HANGOUT] Topics for 5/16/2017

2017-05-16 Thread Jinfeng Ni
Meeting minutes 5/16/2017

Attendees:  Aman, Jinfeng, Karthikeyan, Khurram, Kunal, Padma, Parth, Paul,
, Pritesh, Vitalii, Volodymyr

Two topics discussed.
1. Scheme change exception.
  Jinfeng is working on bugs related to 'soft' SchemeChangeException (such
as DRILL-5327), where data does not present schema change but Drill failed
query intermittently with either SchemaChangeException or incorrect result.
Initial analysis shows the problems comes from either scan operator or
schema-loss operator (one example is UnionAll).
  Aman, Paul, Parth brought up the work of UnionVector. UnionVector targets
at 'hard' schema change, where data itself present schema change. Although
it may help solve issues in DRILL-5327, it might need quite extensive work
to make it work (As of today, enabling UnionVector did not fix the problems
reported). Also, UnionVector might pose challenge for JDBC/ODBC client,
which only takes regular SQL type.

2. Memory fragmentation
  Paul is working on memory fragmentation and size-aware batch /value
vector. Drill keeps a list of chunk of 16MB. For allocation beyond 16MB, it
asks for system memory through netty. With each batch having 64k rows, if a
column width is beyond 256, value vector would requires > 16MB. Drill may
hit OOM, even if there are plenty of free chunk of 16MB.
  Paul's proposal is to impose size constraint on value vector, by
providing a new set of setSafe() methods. Work is in progress and will
submit PR shortly.



On Tue, May 16, 2017 at 10:01 AM, Jinfeng Ni  wrote:

> We will start hangout shortly.
>
> https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc
>
>
> On Mon, May 15, 2017 at 9:53 PM, Jinfeng Ni  wrote:
> > My feeling is that either temp table or putting 100k values into a
> > separate parquet files makes more sense than putting 100k values in a
> > IN list.  Although for such long IN list Drill planner will convert
> > into a JOIN (which is same as temp table / parquet table solutions),
> > there is a big difference in terms of what the query plan looks like.
> > An IN list with 100k values has to be serialized / de-serialized
> > before the plan can be executed. I guess that would create a huge
> > serialized plan, and is not the best solution one may use.
> >
> > Also, putting 100k values in IN list may not be very typical. RDBMS
> > probably impose certain limits on # of values in IN list. For
> > instance, Oracle set the limit to 1000 [1].
> >
> > 1. http://docs.oracle.com/database/122/SQLRF/Expression-
> Lists.htm#SQLRF52099
> >
> > On Mon, May 15, 2017 at 7:11 PM,   wrote:
> >> Hi,
> >>
> >> I am stuck in a problem where instance of apache drill stops working.
> My topic of discussion will be -
> >>
> >> For a scenario, I have 25 parquet file with around 400K-500K records
> with around 10 columns. My select query is such that for one column in
> clause values are around 100K. When I run these queries parallelly,
> instance of apache drill hangs and then gets shut down. Therefore, how to
> design the select queries that apache supports these queries.
> >> One of the solution that we are trying is -
> >> a- Create temp table of 100K values and then use this as an inner
> query. But as I know we can't make temp table at run time from Java code.
> It needs some data source either parquet or some other source to create
> temp table.
> >> b - Create a separate parquet file of all 100K values and use inner
> query instead of all the values directly in the main query.
> >>
> >> Is there any better way to go around this problem or can we just solve
> this problem with simple configuration changes ?
> >>
> >> Regards,
> >> Jasbir Singh
> >>
> >>
> >> -Original Message-
> >> From: Jinfeng Ni [mailto:j...@apache.org]
> >> Sent: Tuesday, May 16, 2017 2:29 AM
> >> To: dev ; user 
> >> Subject: [DRILL HANGOUT] Topics for 5/16/2017
> >>
> >> Hi All,
> >>
> >> Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please
> respond with suggestion of topics for discussion. We will also collect
> topics at the beginning of handout tomorrow.
> >>
> >> Thanks,
> >>
> >> Jinfeng
> >>
> >> 
> >>
> >> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> >> 
> __
> >>
> >> www.accenture.com
>


[GitHub] drill pull request #826: DRILL-5379: Set Hdfs Block Size based on Parquet Bl...

2017-05-16 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/826#discussion_r116886162
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java
 ---
@@ -380,14 +384,21 @@ public void endRecord() throws IOException {
 
   // since ParquetFileWriter will overwrite empty output file (append 
is not supported)
   // we need to re-apply file permission
-  parquetFileWriter = new ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE);
+  if (useConfiguredBlockSize) {
--- End diff --

The API `ParquetFileWriter(conf, schema, path, 
ParquetFileWriter.Mode.OVERWRITE)` will cause the Parquet file writer to set 
the file block size to the greater of the configured files system block size or 
128 MB (the ParquetWriter's row group size). 
Drill's Parquet writer will use the block size specified in Drill's options 
to create a new Parquet row group when the limit is reached (See 
`ParquetRecodWriter.checkBlockSizeReached()` ). If you set Drill's Parquet 
block size to the larger of the configured file system block size or 128 MB, 
you will get the row group to match the HDFS block size. 
Which is what the current code does.
Isn't this what the original JIRA wanted?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #840: DRILL-5517: Size-aware set methods in value vectors

2017-05-16 Thread paul-rogers
GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/840

DRILL-5517: Size-aware set methods in value vectors

Please see DRILL-5517 for an explanation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/paul-rogers/drill DRILL-5517

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/840.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #840


commit 6d9d8f5e93c7ad0fd45242a3aba43334c9385239
Author: Paul Rogers 
Date:   2017-05-16T20:20:32Z

DRILL-5517: Size-aware set methods in value vectors

Please see DRILL-5517 for an explanation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


HBase scan : expected results handling null ?

2017-05-16 Thread Jinfeng Ni
Hi all,

I was looking at HBase scan operator, and noticed one interesting behavior
regarding how Drill/HBase reader handling a column with nulll value.

1. Date prepare (HBase shell).  A table with 3 rows. one row has
'addr.city' column only, while the other two rows have column 'order.id'
only.

create 'customer', {NAME=>'addr'}, {NAME=>'order'}
put 'customer', 'jsmith', 'addr:city', 'sanjose'
put 'customer', 'tom', 'order:id', '2'
put 'customer', 'frank', 'order:id', '3'

scan 'customer'
ROW   COLUMN+CELL
 frankcolumn=order:id,
timestamp=1494969396032, value=3
 jsmith   column=addr:city,
timestamp=1494969355484, value=sanjose
 tom  column=order:id,
timestamp=1494969387941, value=2
3 row(s) in 0.0170 seconds

2. Query in Drill
Q1. check row count in Drill. Result looks good.
select count(*)  from hbase.customer t;
+-+
| EXPR$0  |
+-+
| 3   |
+-+

Q2. Get column 'addr.city' only, just return 1 row.
select convert_from(t.addr.city, 'UTF8') as city  from hbase.customer t;
+--+
|   city   |
+--+
| sanjose  |
+--+

Q3. Get column 'addr.city' and 'order.id', return 3 rows
select convert_from(t.addr.city, 'UTF8') as city,
convert_from(t.`order`.id, 'UTF8') as id  from hbase.customer t;
+--+---+
|   city   |  id   |
+--+---+
| null | 3 |
| sanjose  | null  |
| null | 2 |
+--+---+

Comparing Q2 and Q3, looks like Drill/Hbase scan will skip rows where all
the requested columns are null.

Is this the expected behavior? I understand that behavior comes from HBase
Scan specification. But it looks a bit hard to understand initially, from
SQL's perspective.

If this is expected behavior,  will it make sense to document this ( I did
search drill doc and did not find doc related to this behavior) ?

Another point, if we disable project push-down in query planner, then Q2
would return 3 rows. In theory, project push-down would only impact query
performance, and should not impact query result.

Any thoughts?

Thanks,

Jinfeng


RE: [DRILL HANGOUT] Topics for 5/16/2017

2017-05-16 Thread jasbir.sing
Hi,

I am stuck in a problem where instance of apache drill stops working. My topic 
of discussion will be -

For a scenario, I have 25 parquet file with around 400K-500K records with 
around 10 columns. My select query is such that for one column in clause values 
are around 100K. When I run these queries parallelly, instance of apache drill 
hangs and then gets shut down. Therefore, how to design the select queries that 
apache supports these queries.
One of the solution that we are trying is -
a- Create temp table of 100K values and then use this as an inner query. But as 
I know we can't make temp table at run time from Java code. It needs some data 
source either parquet or some other source to create temp table.
b - Create a separate parquet file of all 100K values and use inner query 
instead of all the values directly in the main query.

Is there any better way to go around this problem or can we just solve this 
problem with simple configuration changes ?

Regards,
Jasbir Singh


-Original Message-
From: Jinfeng Ni [mailto:j...@apache.org]
Sent: Tuesday, May 16, 2017 2:29 AM
To: dev ; user 
Subject: [DRILL HANGOUT] Topics for 5/16/2017

Hi All,

Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please respond 
with suggestion of topics for discussion. We will also collect topics at the 
beginning of handout tomorrow.

Thanks,

Jinfeng



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com


[jira] [Created] (DRILL-5519) Sort fails to spill and results in an OOM

2017-05-16 Thread Rahul Challapalli (JIRA)
Rahul Challapalli created DRILL-5519:


 Summary: Sort fails to spill and results in an OOM
 Key: DRILL-5519
 URL: https://issues.apache.org/jira/browse/DRILL-5519
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.10.0
Reporter: Rahul Challapalli
Assignee: Paul Rogers


Setup :
{code}
git.commit.id.abbrev=1e0a14c
DRILL_MAX_DIRECT_MEMORY="32G"
DRILL_MAX_HEAP="4G"
No of nodes in the drill cluster : 1
{code}

The below query fails with an OOM in the "in-memory sort" code, which means the 
logic which decides when to spill is flawed.
{code}
0: jdbc:drill:zk=10.10.100.190:5181> ALTER SESSION SET 
`exec.sort.disable_managed` = false;
+---+-+
|  ok   |   summary   |
+---+-+
| true  | exec.sort.disable_managed updated.  |
+---+-+
1 row selected (1.022 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> alter session set 
`planner.memory.max_query_memory_per_node` = 334288000;
+---++
|  ok   |  summary   |
+---++
| true  | planner.memory.max_query_memory_per_node updated.  |
+---++
1 row selected (0.369 seconds)
0: jdbc:drill:zk=10.10.100.190:5181> select count(*) from (select * from 
(select flatten(flatten(lst_lst)) num from 
dfs.`/drill/testdata/resource-manager/nested-large.json`) d order by d.num) d1 
where d1.num < -1;
Error: RESOURCE ERROR: One or more nodes ran out of memory while executing the 
query.

Unable to allocate buffer of size 4194304 (rounded from 320) due to memory 
limit. Current allocation: 16015936
Fragment 2:2

[Error Id: 4d9cc59a-b5d1-4ca9-9b26-69d9438f0bee on qa-node190.qa.lab:31010] 
(state=,code=0)
{code}

Below is the exception from the logs
{code}
2017-05-16 13:46:33,233 [26e49afc-cf45-637b-acc1-a70fee7fe7e2:frag:2:2] INFO  
o.a.d.e.w.fragment.FragmentExecutor - User Error Occurred: One or more nodes 
ran out of memory while executing the query. (Unable to allocate buffer of size 
4194304 (rounded from 320) due to memory limit. Current allocation: 
16015936)
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more 
nodes ran out of memory while executing the query.

Unable to allocate buffer of size 4194304 (rounded from 320) due to memory 
limit. Current allocation: 16015936

[Error Id: 4d9cc59a-b5d1-4ca9-9b26-69d9438f0bee ]
at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544)
 ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:244)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_111]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_111]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: Unable to 
allocate buffer of size 4194304 (rounded from 320) due to memory limit. 
Current allocation: 16015936
at 
org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:220) 
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:195) 
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.test.generated.MSorterGen44.setup(MSortTemplate.java:91) 
~[na:na]
at 
org.apache.drill.exec.physical.impl.xsort.managed.MergeSort.merge(MergeSort.java:110)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.sortInMemory(ExternalSortBatch.java:1159)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:687)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at 
org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next(IteratorValidatorBatchIterator.java:215)
 

[jira] [Created] (DRILL-5518) Roll-up of a number of test framework enhancements

2017-05-16 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5518:
--

 Summary: Roll-up of a number of test framework enhancements
 Key: DRILL-5518
 URL: https://issues.apache.org/jira/browse/DRILL-5518
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.11.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Priority: Minor
 Fix For: 1.11.0


Recent development work identified a number of minor enhancements to the 
"sub-operator" unit tests:

* Create a {{SubOperatorTest}} base class to do routine setup and shutdown.
* Additional methods to simplify creating complex schemas with field widths.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (DRILL-5517) Provide size-aware set operations in value vectors

2017-05-16 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-5517:
--

 Summary: Provide size-aware set operations in value vectors
 Key: DRILL-5517
 URL: https://issues.apache.org/jira/browse/DRILL-5517
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.11.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.11.0


DRILL-5211 describes a memory fragmentation issue in Drill. The resolution is 
to limit vector sizes to 16 MB (the size of Netty memory allocation "slabs.") 
Effort starts by providing "size-aware" set operations in value vectors which:

* Operate as {{setSafe()}} while vectors are below 16 MB.
* Return false if setting the value (and growing the vector) would exceed the 
vector limit.

The methods in value vectors then become the foundation on which we can 
construct size-aware record batch "writers."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (DRILL-5204) Extend mock data source to use table specs from SQL

2017-05-16 Thread Paul Rogers (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-5204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-5204.

Resolution: Fixed

Not sure why this was not closed earlier. Feature has been checked into Master.

Set up the mock data source. Then:

{code}
SELECT id_i, name_s50 FROM `mock`.`customers_1M`
{code}

The column and table names are fictions. The important part is the suffix. For 
columns, "_i" means integer, "_sx" means a string of length x, and so on. For 
tables, "x" means x rows. "xK" means x thousand rows. "xM" means x million rows.

See the {{ExampleTest}} class for details.

> Extend mock data source to use table specs from SQL
> ---
>
> Key: DRILL-5204
> URL: https://issues.apache.org/jira/browse/DRILL-5204
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Tools, Build & Test
>Affects Versions: 1.9.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> DRILL-5152 provided a simple way to generate mock data from SQL:
> {code}
> SELECT colName_type FROM `mock`.`tableName_size` ...
> {code}
> The fix in that release encoded types and record counts directly in the SQL, 
> which is very handy for many simple cases.
> The original mock data source has another feature: it lets you create 
> multiple mock blocks of data that can be read in multiple threads. Later 
> additions made it easy to repeat a column definition (to generate, say, a 
> table with 1000 columns), to choose the data generator class, etc. All of 
> this was available only when writing physical plans by hand and encoding the 
> definition in the sub scan for the mock data source.
> This enhancement extends the SQL feature to allow the definitions to appear 
> in a JSON file easily referenced from SQL. The JSON file must be somewhere on 
> the class path (typically in a resources directory.) Then:
> {code}
> SELECT red, blue, green FROM `mock`.`foo/colors.json` ...
> {code}
> Is interpreted to mean, "the file colors.json defines a mock data source, 
> perhaps with repeated columns, perhaps with multiple fragments. From that 
> mock data source, select the three columns red, blue and green."
> With this change, tests can include quite sophisticated mock data sources, 
> simplifying debugging of plans with multiple fragments and/or more complex 
> table structures.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] drill pull request #836: DRILL-5511: Additional UserException categories

2017-05-16 Thread paul-rogers
Github user paul-rogers closed the pull request at:

https://github.com/apache/drill/pull/836


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #836: DRILL-5511: Additional UserException categories

2017-05-16 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/836
  
Withdrawn based on Parth's comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DRILL HANGOUT] Topics for 5/16/2017

2017-05-16 Thread Jinfeng Ni
We will start hangout shortly.

https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc


On Mon, May 15, 2017 at 9:53 PM, Jinfeng Ni  wrote:
> My feeling is that either temp table or putting 100k values into a
> separate parquet files makes more sense than putting 100k values in a
> IN list.  Although for such long IN list Drill planner will convert
> into a JOIN (which is same as temp table / parquet table solutions),
> there is a big difference in terms of what the query plan looks like.
> An IN list with 100k values has to be serialized / de-serialized
> before the plan can be executed. I guess that would create a huge
> serialized plan, and is not the best solution one may use.
>
> Also, putting 100k values in IN list may not be very typical. RDBMS
> probably impose certain limits on # of values in IN list. For
> instance, Oracle set the limit to 1000 [1].
>
> 1. http://docs.oracle.com/database/122/SQLRF/Expression-Lists.htm#SQLRF52099
>
> On Mon, May 15, 2017 at 7:11 PM,   wrote:
>> Hi,
>>
>> I am stuck in a problem where instance of apache drill stops working. My 
>> topic of discussion will be -
>>
>> For a scenario, I have 25 parquet file with around 400K-500K records with 
>> around 10 columns. My select query is such that for one column in clause 
>> values are around 100K. When I run these queries parallelly, instance of 
>> apache drill hangs and then gets shut down. Therefore, how to design the 
>> select queries that apache supports these queries.
>> One of the solution that we are trying is -
>> a- Create temp table of 100K values and then use this as an inner query. But 
>> as I know we can't make temp table at run time from Java code. It needs some 
>> data source either parquet or some other source to create temp table.
>> b - Create a separate parquet file of all 100K values and use inner query 
>> instead of all the values directly in the main query.
>>
>> Is there any better way to go around this problem or can we just solve this 
>> problem with simple configuration changes ?
>>
>> Regards,
>> Jasbir Singh
>>
>>
>> -Original Message-
>> From: Jinfeng Ni [mailto:j...@apache.org]
>> Sent: Tuesday, May 16, 2017 2:29 AM
>> To: dev ; user 
>> Subject: [DRILL HANGOUT] Topics for 5/16/2017
>>
>> Hi All,
>>
>> Out bi-weekly Drill hangout is tomorrow (5/16/2017, 10AM PDT). Please 
>> respond with suggestion of topics for discussion. We will also collect 
>> topics at the beginning of handout tomorrow.
>>
>> Thanks,
>>
>> Jinfeng
>>
>> 
>>
>> This message is for the designated recipient only and may contain 
>> privileged, proprietary, or otherwise confidential information. If you have 
>> received it in error, please notify the sender immediately and delete the 
>> original. Any other use of the e-mail by you is prohibited. Where allowed by 
>> local law, electronic communications with Accenture and its affiliates, 
>> including e-mail and instant messaging (including content), may be scanned 
>> by our systems for the purposes of information security and assessment of 
>> internal compliance with Accenture policy.
>> __
>>
>> www.accenture.com


[GitHub] drill issue #839: DRILL-5516: Use max allowed allocated memory when defining...

2017-05-16 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/839
  
The right approach is not to simply allow HBase to use more memory. The 
right approach is to limit memory.

Fortunately, another project is underway to do just that. Let's 
collaborate. In the next week or so I'll do a PR for the framework to limit 
batch sizes in readers, along with an implementation for the "compliant" text 
readers.

Maybe you can use that framework to retrofit the HBase reader to also limit 
it's batch size. Basically, we limit the length of the longest vector to 16 MB.

The present patch, using unlimited memory, has all kinds of other problems 
-- the very problems we are trying to solve, so it is not helpful to move 
forward in one area, backward in another.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #839: DRILL-5516: Use max allowed allocated memory when d...

2017-05-16 Thread arina-ielchiieva
GitHub user arina-ielchiieva opened a pull request:

https://github.com/apache/drill/pull/839

DRILL-5516: Use max allowed allocated memory when defining batch size…

… for hbase record reader

Instead of using rows number (4000), we will use max allowed allocated 
memory which will default to 64 mb. If first row in batch is larger than 
allowed default, it will be written in batch but batch will contain only this 
row.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/arina-ielchiieva/drill DRILL-5516

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/839.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #839


commit ce3f227f7f06baa5e43f8f2529036899549495aa
Author: Arina Ielchiieva 
Date:   2017-05-15T15:51:02Z

DRILL-5516: Use max allowed allocated memory when defining batch size for 
hbase record reader




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #835: DRILL-5399: Fix race condition in DrillComplexWrite...

2017-05-16 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/835#discussion_r116713916
  
--- Diff: 
logical/src/main/java/org/apache/drill/common/expression/FunctionHolderExpression.java
 ---
@@ -28,6 +28,7 @@
 public abstract class FunctionHolderExpression extends 
LogicalExpressionBase {
   public final ImmutableList args;
   public final String nameUsed;
+  private FieldReference ref;
--- End diff --

Added javadoc to explain its purpose. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #835: DRILL-5399: Fix race condition in DrillComplexWrite...

2017-05-16 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/835#discussion_r116713065
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/AbstractFuncHolder.java
 ---
@@ -48,4 +49,13 @@ public boolean isNested() {
   public abstract MajorType getParmMajorType(int i);
 
   public abstract int getParamCount();
+
+  /**
+   * Checks that the current object is an instance of 
DrillComplexWriterFuncHolder class.
--- End diff --

Changed the javadoc. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #835: DRILL-5399: Fix race condition in DrillComplexWrite...

2017-05-16 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/835#discussion_r116712517
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/expr/fn/HiveFuncHolder.java
 ---
@@ -147,10 +148,11 @@ public int getParamCount() {
* @param g
* @param inputVariables
* @param workspaceJVars
+   * @param ref
--- End diff --

Moved this javadoc to the superclass and added description to params. 
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #835: DRILL-5399: Fix race condition in DrillComplexWrite...

2017-05-16 Thread vvysotskyi
Github user vvysotskyi commented on a diff in the pull request:

https://github.com/apache/drill/pull/835#discussion_r116714192
  
--- Diff: 
logical/src/main/java/org/apache/drill/common/expression/FunctionHolderExpression.java
 ---
@@ -80,4 +81,16 @@ public String getName() {
   /** Return the underlying function implementation holder. */
   public abstract FuncHolder getHolder();
 
+  public FieldReference getReference() {
--- End diff --

Done. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (DRILL-5516) Use max allowed allocated memory when defining batch size for hbase record reader

2017-05-16 Thread Arina Ielchiieva (JIRA)
Arina Ielchiieva created DRILL-5516:
---

 Summary: Use max allowed allocated memory when defining batch size 
for hbase record reader
 Key: DRILL-5516
 URL: https://issues.apache.org/jira/browse/DRILL-5516
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - HBase
Affects Versions: 1.10.0
Reporter: Arina Ielchiieva
Assignee: Arina Ielchiieva


If early limit 0 optimization is set to true (alter session set 
`planner.enable_limit0_optimization` = true), when executing limit 0 queries 
Drill will return data type from available metadata if possible.
When Drill can not determine data types from metadata (or if early limit 0 
optimization is set to false), Drill will read first batch of data and 
determine schema.
Hbase reader determines max batch size using magic number (4000) which can lead 
to OOM when row size is large. The overall vector/batch size issue will be 
reconsidered in future releases.This is temporary fix to avoid OOM.

Instead of using rows number, we will use max allowed allocated memory which 
will default to 64 mb. If first row in batch is larger than allowed default, it 
will be written in batch but batch will contain only this row.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Running cartesian joins on Drill

2017-05-16 Thread Muhammad Gelbana
You are correct Aman. Here is the JIRA issue


This thread has been very helpful. Thank you all.

*-*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana

On Fri, May 12, 2017 at 6:50 AM, Aman Sinha  wrote:

> Muhammad,
> The join condition  ‘a = b or (a is null && b is null)’ works.
> Internally, this is converted to  ‘a is not distinct from b’ which is
> processed by Drill.
> For some reason, if the second form is directly supplied in the user
> query, it is not working and ends up with the Cartesian join condition.
> Drill leverages Calcite for this (you can see CALCITE-1200 for some
> background).
> Can you file a JIRA for this ?
>
> -Aman
>
> From: "Aman Sinha (asi...@mapr.com)" 
> Date: Thursday, May 11, 2017 at 4:29 PM
> To: dev , user 
> Cc: Shadi Khalifa 
> Subject: Re: Running cartesian joins on Drill
>
>
> I think Muhammad may be trying to run his original query with IS NOT
> DISTINCT FROM.   That discussion got side-tracked into Cartesian joins
> because his query was not getting planned and the error was about Cartesian
> join.
>
> Muhammad,  can you try with the equivalent version below ?  You mentioned
> the rewrite but did you try the rewritten version ?
>
>
>
> SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
>
> LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
>
> `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
>
> ​​
>
> `t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
> IS NULL) )
>
>
>
> On 5/11/17, 3:23 PM, "Zelaine Fong"  wrote:
>
>
>
> I’m not sure why it isn’t working for you.  Using Drill 1.10, here’s
> my output:
>
>
>
> 0: jdbc:drill:zk=local> alter session set 
> `planner.enable_nljoin_for_scalar_only`
> = false;
>
> +---+-+
>
> |  ok   | summary |
>
> +---+-+
>
> | true  | planner.enable_nljoin_for_scalar_only updated.  |
>
> +---+-+
>
> 1 row selected (0.137 seconds)
>
> 0: jdbc:drill:zk=local> explain plan for select * from
> dfs.`/Users/zfong/foo.csv` t1, dfs.`/Users/zfong/foo.csv` t2;
>
> +--+--+
>
> | text | json |
>
> +--+--+
>
> | 00-00Screen
>
> 00-01  ProjectAllowDup(*=[$0], *0=[$1])
>
> 00-02NestedLoopJoin(condition=[true], joinType=[inner])
>
> 00-04  Project(T2¦¦*=[$0])
>
> 00-06Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
> 00-03  Project(T3¦¦*=[$0])
>
> 00-05Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/zfong/foo.csv, numFiles=1, columns=[`*`],
> files=[file:/Users/zfong/foo.csv]]])
>
>
>
> -- Zelaine
>
>
>
> On 5/11/17, 3:17 PM, "Muhammad Gelbana"  wrote:
>
>
>
> ​But the query I provided failed to be planned because it's a
> cartesian
>
> join, although I've set the option you mentioned to false. Is
> there a
>
> reason why wouldn't Drill rules physically implement the logical
> join in my
>
> query to a nested loop join ?
>
>
>
> *-*
>
> *Muhammad Gelbana*
>
> http://www.linkedin.com/in/mgelbana
>
>
>
> On Thu, May 11, 2017 at 5:05 PM, Zelaine Fong 
> wrote:
>
>
>
> > Provided `planner.enable_nljoin_for_scalar_only` is set to
> false, even
>
> > without an explicit join condition, the query should use the
> Cartesian
>
> > join/nested loop join.
>
> >
>
> > -- Zelaine
>
> >
>
> > On 5/11/17, 4:20 AM, "Anup Tiwari" 
> wrote:
>
> >
>
> > Hi,
>
> >
>
> > I have one question here.. so if we have to use Cartesian
> join in Drill
>
> > then do we have to follow some workaround like Shadi mention
> : adding a
>
> > dummy column on the fly that has the value 1 in both tables
> and then
>
> > join
>
> > on that column leading to having a match of every row of the
> first
>
> > table
>
> > with every row of the second table, hence do a Cartesian
> product?
>
> > OR
>
> > If we just don't specify join condition like :
>
> > select a.*, b.* from tt1 as a, tt2 b; then will it
> internally treat
>
> > this
>
> > query as Cartesian join.
>
> >
>
> > Regards,
>
> > *Anup Tiwari*
>
> >
>
> > On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <
> 

[jira] [Created] (DRILL-5515) "IS NO DISTINCT FROM" and it's equivalent form aren't handled likewise

2017-05-16 Thread Muhammad Gelbana (JIRA)
Muhammad Gelbana created DRILL-5515:
---

 Summary: "IS NO DISTINCT FROM" and it's equivalent form aren't 
handled likewise
 Key: DRILL-5515
 URL: https://issues.apache.org/jira/browse/DRILL-5515
 Project: Apache Drill
  Issue Type: Bug
  Components: Query Planning & Optimization
Affects Versions: 1.10.0, 1.9.0
Reporter: Muhammad Gelbana


The following query fails to execute
{code:sql}SELECT * FROM (SELECT `UserID` FROM `dfs`.`path_ot_parquet` tc) `t0` 
INNER JOIN (SELECT `UserID` FROM `dfs`.`path_ot_parquet` tc) `t1` ON 
(`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`){code}
and produces the following error message
{noformat}org.apache.drill.common.exceptions.UserRemoteException: 
UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to 
either a cartesian join or an inequality join [Error Id: 
0bd41e06-ccd7-45d6-a038-3359bf5a4a7f on mgelbana-incorta:31010]{noformat}
While the query's equivalent form runs fine
{code:sql}SELECT * FROM (SELECT `UserID` FROM `dfs`.`path_ot_parquet` tc) `t0` 
INNER JOIN (SELECT `UserID` FROM `dfs`.`path_ot_parquet` tc) `t1` ON 
(`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL AND `t1`.`UserID` IS 
NULL)){code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[GitHub] drill pull request #835: DRILL-5399: Fix race condition in DrillComplexWrite...

2017-05-16 Thread arina-ielchiieva
Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/835#discussion_r116682109
  
--- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/expr/fn/HiveFuncHolder.java
 ---
@@ -147,10 +148,11 @@ public int getParamCount() {
* @param g
* @param inputVariables
* @param workspaceJVars
+   * @param ref
--- End diff --

Please add description to params to avoid warnings in IDE.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #835: DRILL-5399: Fix race condition in DrillComplexWrite...

2017-05-16 Thread arina-ielchiieva
Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/835#discussion_r116682290
  
--- Diff: 
logical/src/main/java/org/apache/drill/common/expression/FunctionHolderExpression.java
 ---
@@ -80,4 +81,16 @@ public String getName() {
   /** Return the underlying function implementation holder. */
   public abstract FuncHolder getHolder();
 
+  public FieldReference getReference() {
--- End diff --

Please rename method to `getFieldReference` and variable to 
`fieldReference`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---