Looks like Drill is doing the partitioning as requested then. May not be 
optimal though.

Is there a reason why you want to subpartition this much? You may be better of 
to just partition by l_shipdate (not shipmate, autocorrect got me there). Or 
use columns with much lower cardinality to test subpartitioning.

—Andries


> On Aug 26, 2015, at 3:05 PM, rahul challapalli <[email protected]> 
> wrote:
> 
> Steven,
> 
> You were right. The count is 606240 which is 8*75780.
> 
> 
> Stefan & Andries,
> 
> Below is the distinct count or cardinality
> 
> select count(*) from (select l_shipdate, l_receiptdate from
> dfs.`/drill/testdata/tpch100/
> lineitem` group by l_shipdate, l_receiptdate) sub;
> +---------+
> | EXPR$0  |
> +---------+
> | 75780   |
> +---------+
> 
> - Rahul
> 
> 
> 
> 
> 
> On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> [email protected]> wrote:
> 
>> What is the distinct count for this columns? IIRC TPC-H has at least 5
>> years of data irrespective of SF, so you are requesting a lot of
>> partitions. 76K sounds about right for 5 years of TPCH shipmate and
>> correlating receipt date data, your query doesn’t count the actual files.
>> 
>> Try to partition just on the shipmate column first.
>> 
>> —Andries
>> 
>> 
>>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected]>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> Is it possible that the combination values of  (l_shipdate,
>>> l_receiptdate) have a very high cardinality?
>>> I would think you are creating partition files for a small subset of the
>>> data.
>>> 
>>> Please keep in mind that I know nothing about TPCH SF100 and only a
>> little
>>> about Drill :).
>>> 
>>> Regards,
>>> -Stefan
>>> 
>>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote:
>>> 
>>>> It would be helpful if you could figure out what the file count is. But
>>>> here are some thoughs:
>>>> 
>>>> What is the value of the option:
>>>> store.partition.hash_distribute
>>>> 
>>>> If it is false, which it is by default, then every fragment will
>>>> potentially have data in every partition. In this case, that could
>> increase
>>>> the number of files by a factor of 8.
>>>> 
>>>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
>>>> [email protected]> wrote:
>>>> 
>>>>> Drillers,
>>>>> 
>>>>> I executed the below query on TPCH SF100 with drill and it took ~2hrs
>> to
>>>>> complete on a 2 node cluster.
>>>>> 
>>>>> alter session set `planner.width.max_per_node` = 4;
>>>>> alter session set `planner.memory.max_query_memory_per_node` =
>>>> 8147483648;
>>>>> create table lineitem partition by (l_shipdate, l_receiptdate) as
>> select
>>>> *
>>>>> from dfs.`/drill/testdata/tpch100/lineitem`;
>>>>> 
>>>>> The below query returned 75780, so I expected drill to create the same
>> no
>>>>> of files or may be a little more. But drill created so many files that
>> a
>>>>> "hadoop fs -count" command failed with a "GC overhead limit exceeded".
>> (I
>>>>> did not change the default parquet block size)
>>>>> 
>>>>> select count(*) from (select l_shipdate, l_receiptdate from
>>>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
>>>> l_receiptdate)
>>>>> sub;
>>>>> +---------+
>>>>> | EXPR$0  |
>>>>> +---------+
>>>>> | 75780   |
>>>>> +---------+
>>>>> 
>>>>> 
>>>>> Any thoughts on why drill is creating so many files?
>>>>> 
>>>>> - Rahul
>>>>> 
>>>> 
>> 
>> 

Reply via email to