Re: No of files created by CTAS auto partition feature

Steven Phillips Wed, 26 Aug 2015 12:31:28 -0700

It would be helpful if you could figure out what the file count is. But
here are some thoughs:


What is the value of the option:
store.partition.hash_distribute

If it is false, which it is by default, then every fragment will
potentially have data in every partition. In this case, that could increase
the number of files by a factor of 8.

On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
[email protected]> wrote:

> Drillers,
>
> I executed the below query on TPCH SF100 with drill and it took ~2hrs to
> complete on a 2 node cluster.
>
> alter session set `planner.width.max_per_node` = 4;
> alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
> create table lineitem partition by (l_shipdate, l_receiptdate) as select *
> from dfs.`/drill/testdata/tpch100/lineitem`;
>
> The below query returned 75780, so I expected drill to create the same no
> of files or may be a little more. But drill created so many files that a
> "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
> did not change the default parquet block size)
>
> select count(*) from (select l_shipdate, l_receiptdate from
> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, l_receiptdate)
> sub;
> +---------+
> | EXPR$0  |
> +---------+
> | 75780   |
> +---------+
>
>
> Any thoughts on why drill is creating so many files?
>
> - Rahul
>

Re: No of files created by CTAS auto partition feature

Reply via email to