Re: No of files created by CTAS auto partition feature

Stefán Baxter Wed, 26 Aug 2015 12:35:30 -0700

Hi,

Is it possible that the combination values of  (l_shipdate,
l_receiptdate) have a very high cardinality?
I would think you are creating partition files for a small subset of the
data.


Please keep in mind that I know nothing about TPCH SF100 and only a little
about Drill :).

Regards,
 -Stefan

On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote:

> It would be helpful if you could figure out what the file count is. But
> here are some thoughs:
>
> What is the value of the option:
> store.partition.hash_distribute
>
> If it is false, which it is by default, then every fragment will
> potentially have data in every partition. In this case, that could increase
> the number of files by a factor of 8.
>
> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> [email protected]> wrote:
>
> > Drillers,
> >
> > I executed the below query on TPCH SF100 with drill and it took ~2hrs to
> > complete on a 2 node cluster.
> >
> > alter session set `planner.width.max_per_node` = 4;
> > alter session set `planner.memory.max_query_memory_per_node` =
> 8147483648;
> > create table lineitem partition by (l_shipdate, l_receiptdate) as select
> *
> > from dfs.`/drill/testdata/tpch100/lineitem`;
> >
> > The below query returned 75780, so I expected drill to create the same no
> > of files or may be a little more. But drill created so many files that a
> > "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
> > did not change the default parquet block size)
> >
> > select count(*) from (select l_shipdate, l_receiptdate from
> > dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
> l_receiptdate)
> > sub;
> > +---------+
> > | EXPR$0  |
> > +---------+
> > | 75780   |
> > +---------+
> >
> >
> > Any thoughts on why drill is creating so many files?
> >
> > - Rahul
> >
>

Re: No of files created by CTAS auto partition feature

Reply via email to