Steven, You were right. The count is 606240 which is 8*75780.
Stefan & Andries, Below is the distinct count or cardinality select count(*) from (select l_shipdate, l_receiptdate from dfs.`/drill/testdata/tpch100/ lineitem` group by l_shipdate, l_receiptdate) sub; +---------+ | EXPR$0 | +---------+ | 75780 | +---------+ - Rahul On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht < [email protected]> wrote: > What is the distinct count for this columns? IIRC TPC-H has at least 5 > years of data irrespective of SF, so you are requesting a lot of > partitions. 76K sounds about right for 5 years of TPCH shipmate and > correlating receipt date data, your query doesn’t count the actual files. > > Try to partition just on the shipmate column first. > > —Andries > > > > On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected]> > wrote: > > > > Hi, > > > > Is it possible that the combination values of (l_shipdate, > > l_receiptdate) have a very high cardinality? > > I would think you are creating partition files for a small subset of the > > data. > > > > Please keep in mind that I know nothing about TPCH SF100 and only a > little > > about Drill :). > > > > Regards, > > -Stefan > > > > On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote: > > > >> It would be helpful if you could figure out what the file count is. But > >> here are some thoughs: > >> > >> What is the value of the option: > >> store.partition.hash_distribute > >> > >> If it is false, which it is by default, then every fragment will > >> potentially have data in every partition. In this case, that could > increase > >> the number of files by a factor of 8. > >> > >> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli < > >> [email protected]> wrote: > >> > >>> Drillers, > >>> > >>> I executed the below query on TPCH SF100 with drill and it took ~2hrs > to > >>> complete on a 2 node cluster. > >>> > >>> alter session set `planner.width.max_per_node` = 4; > >>> alter session set `planner.memory.max_query_memory_per_node` = > >> 8147483648; > >>> create table lineitem partition by (l_shipdate, l_receiptdate) as > select > >> * > >>> from dfs.`/drill/testdata/tpch100/lineitem`; > >>> > >>> The below query returned 75780, so I expected drill to create the same > no > >>> of files or may be a little more. But drill created so many files that > a > >>> "hadoop fs -count" command failed with a "GC overhead limit exceeded". > (I > >>> did not change the default parquet block size) > >>> > >>> select count(*) from (select l_shipdate, l_receiptdate from > >>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, > >> l_receiptdate) > >>> sub; > >>> +---------+ > >>> | EXPR$0 | > >>> +---------+ > >>> | 75780 | > >>> +---------+ > >>> > >>> > >>> Any thoughts on why drill is creating so many files? > >>> > >>> - Rahul > >>> > >> > >
