Re: No of files created by CTAS auto partition feature

Andries Engelbrecht Wed, 26 Aug 2015 13:27:05 -0700

What is the distinct count for this columns? IIRC TPC-H has at least 5 years of 
data irrespective of SF, so you are requesting a lot of partitions. 76K sounds 
about right for 5 years of TPCH shipmate and correlating receipt date data, 
your query doesn’t count the actual files.


Try to partition just on the shipmate column first.

—Andries


> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected]> wrote:
> 
> Hi,
> 
> Is it possible that the combination values of  (l_shipdate,
> l_receiptdate) have a very high cardinality?
> I would think you are creating partition files for a small subset of the
> data.
> 
> Please keep in mind that I know nothing about TPCH SF100 and only a little
> about Drill :).
> 
> Regards,
> -Stefan
> 
> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote:
> 
>> It would be helpful if you could figure out what the file count is. But
>> here are some thoughs:
>> 
>> What is the value of the option:
>> store.partition.hash_distribute
>> 
>> If it is false, which it is by default, then every fragment will
>> potentially have data in every partition. In this case, that could increase
>> the number of files by a factor of 8.
>> 
>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
>> [email protected]> wrote:
>> 
>>> Drillers,
>>> 
>>> I executed the below query on TPCH SF100 with drill and it took ~2hrs to
>>> complete on a 2 node cluster.
>>> 
>>> alter session set `planner.width.max_per_node` = 4;
>>> alter session set `planner.memory.max_query_memory_per_node` =
>> 8147483648;
>>> create table lineitem partition by (l_shipdate, l_receiptdate) as select
>> *
>>> from dfs.`/drill/testdata/tpch100/lineitem`;
>>> 
>>> The below query returned 75780, so I expected drill to create the same no
>>> of files or may be a little more. But drill created so many files that a
>>> "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
>>> did not change the default parquet block size)
>>> 
>>> select count(*) from (select l_shipdate, l_receiptdate from
>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
>> l_receiptdate)
>>> sub;
>>> +---------+
>>> | EXPR$0  |
>>> +---------+
>>> | 75780   |
>>> +---------+
>>> 
>>> 
>>> Any thoughts on why drill is creating so many files?
>>> 
>>> - Rahul
>>> 
>>

Re: No of files created by CTAS auto partition feature

Reply via email to