Steven,

You were right. The count is 606240 which is 8*75780.


Stefan & Andries,

Below is the distinct count or cardinality

select count(*) from (select l_shipdate, l_receiptdate from
dfs.`/drill/testdata/tpch100/
lineitem` group by l_shipdate, l_receiptdate) sub;
+---------+
| EXPR$0  |
+---------+
| 75780   |
+---------+

- Rahul





On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht <
[email protected]> wrote:

> What is the distinct count for this columns? IIRC TPC-H has at least 5
> years of data irrespective of SF, so you are requesting a lot of
> partitions. 76K sounds about right for 5 years of TPCH shipmate and
> correlating receipt date data, your query doesn’t count the actual files.
>
> Try to partition just on the shipmate column first.
>
> —Andries
>
>
> > On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected]>
> wrote:
> >
> > Hi,
> >
> > Is it possible that the combination values of  (l_shipdate,
> > l_receiptdate) have a very high cardinality?
> > I would think you are creating partition files for a small subset of the
> > data.
> >
> > Please keep in mind that I know nothing about TPCH SF100 and only a
> little
> > about Drill :).
> >
> > Regards,
> > -Stefan
> >
> > On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote:
> >
> >> It would be helpful if you could figure out what the file count is. But
> >> here are some thoughs:
> >>
> >> What is the value of the option:
> >> store.partition.hash_distribute
> >>
> >> If it is false, which it is by default, then every fragment will
> >> potentially have data in every partition. In this case, that could
> increase
> >> the number of files by a factor of 8.
> >>
> >> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli <
> >> [email protected]> wrote:
> >>
> >>> Drillers,
> >>>
> >>> I executed the below query on TPCH SF100 with drill and it took ~2hrs
> to
> >>> complete on a 2 node cluster.
> >>>
> >>> alter session set `planner.width.max_per_node` = 4;
> >>> alter session set `planner.memory.max_query_memory_per_node` =
> >> 8147483648;
> >>> create table lineitem partition by (l_shipdate, l_receiptdate) as
> select
> >> *
> >>> from dfs.`/drill/testdata/tpch100/lineitem`;
> >>>
> >>> The below query returned 75780, so I expected drill to create the same
> no
> >>> of files or may be a little more. But drill created so many files that
> a
> >>> "hadoop fs -count" command failed with a "GC overhead limit exceeded".
> (I
> >>> did not change the default parquet block size)
> >>>
> >>> select count(*) from (select l_shipdate, l_receiptdate from
> >>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate,
> >> l_receiptdate)
> >>> sub;
> >>> +---------+
> >>> | EXPR$0  |
> >>> +---------+
> >>> | 75780   |
> >>> +---------+
> >>>
> >>>
> >>> Any thoughts on why drill is creating so many files?
> >>>
> >>> - Rahul
> >>>
> >>
>
>

Reply via email to