Re: No of files created by CTAS auto partition feature

rahul challapalli Wed, 26 Aug 2015 15:49:49 -0700

Well this for generating some testdata

- Rahul


On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht <
[email protected]> wrote:

> Looks like Drill is doing the partitioning as requested then. May not be
> optimal though.
>
> Is there a reason why you want to subpartition this much? You may be
> better of to just partition by l_shipdate (not shipmate, autocorrect got me
> there). Or use columns with much lower cardinality to test subpartitioning.
>
> —Andries
>
>
> > On Aug 26, 2015, at 3:05 PM, rahul challapalli <
> [email protected]> wrote:
> >
> > Steven,
> >
> > You were right. The count is 606240 which is 8*75780.
> >
> >
> > Stefan & Andries,
> >
> > Below is the distinct count or cardinality
> >
> > select count(*) from (select l_shipdate, l_receiptdate from
> > dfs.`/drill/testdata/tpch100/
> > lineitem` group by l_shipdate, l_receiptdate) sub;
> > +---------+
> > | EXPR$0  |
> > +---------+
> > | 75780   |
> > +---------+
> >
> > - Rahul
> >
> >
> >
> >
> >
> > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
> > [email protected]> wrote:
> >
> >> What is the distinct count for this columns? IIRC TPC-H has at least 5
> >> years of data irrespective of SF, so you are requesting a lot of
> >> partitions. 76K sounds about right for 5 years of TPCH shipmate and
> >> correlating receipt date data, your query doesn’t count the actual
> files.
> >>
> >> Try to partition just on the shipmate column first.
> >>
> >> —Andries
> >>
> >>
> >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected]
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is it possible that the combination values of  (l_shipdate,
> >>> l_receiptdate) have a very high cardinality?
> >>> I would think you are creating partition files for a small subset of
> the
> >>> data.
> >>>
> >>> Please keep in mind that I know nothing about TPCH SF100 and only a
> >> little
> >>> about Drill :).
> >>>
> >>> Regards,
> >>> -Stefan
> >>>
> >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]>
> wrote:
> >>>
> >>>> It would be helpful if you could figure out what the file count is.
> But
> >>>> here are some thoughs:
> >>>>
> >>>> What is the value of the option:
> >>>> store.partition.hash_distribute
> >>>>
> >>>> If it is false, which it is by default, then every fragment will
> >>>> potentially have data in every partition. In this case, that could
> >> increase
> >>>> the number of files by a factor of 8.
> >>>>
> >>>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Drillers,
> >>>>>
> >>>>> I executed the below query on TPCH SF100 with drill and it took ~2hrs
> >> to
> >>>>> complete on a 2 node cluster.
> >>>>>
> >>>>> alter session set `planner.width.max_per_node` = 4;
> >>>>> alter session set `planner.memory.max_query_memory_per_node` =
> >>>> 8147483648;
> >>>>> create table lineitem partition by (l_shipdate, l_receiptdate) as
> >> select
> >>>> *
> >>>>> from dfs.`/drill/testdata/tpch100/lineitem`;
> >>>>>
> >>>>> The below query returned 75780, so I expected drill to create the
> same
> >> no
> >>>>> of files or may be a little more. But drill created so many files
> that
> >> a
> >>>>> "hadoop fs -count" command failed with a "GC overhead limit
> exceeded".
> >> (I
> >>>>> did not change the default parquet block size)
> >>>>>
> >>>>> select count(*) from (select l_shipdate, l_receiptdate from
> >>>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
> >>>> l_receiptdate)
> >>>>> sub;
> >>>>> +---------+
> >>>>> | EXPR$0  |
> >>>>> +---------+
> >>>>> | 75780   |
> >>>>> +---------+
> >>>>>
> >>>>>
> >>>>> Any thoughts on why drill is creating so many files?
> >>>>>
> >>>>> - Rahul
> >>>>>
> >>>>
> >>
> >>
>
>

Re: No of files created by CTAS auto partition feature

Reply via email to