Looks like Drill is doing the partitioning as requested then. May not be optimal though.
Is there a reason why you want to subpartition this much? You may be better of to just partition by l_shipdate (not shipmate, autocorrect got me there). Or use columns with much lower cardinality to test subpartitioning. —Andries > On Aug 26, 2015, at 3:05 PM, rahul challapalli <[email protected]> > wrote: > > Steven, > > You were right. The count is 606240 which is 8*75780. > > > Stefan & Andries, > > Below is the distinct count or cardinality > > select count(*) from (select l_shipdate, l_receiptdate from > dfs.`/drill/testdata/tpch100/ > lineitem` group by l_shipdate, l_receiptdate) sub; > +---------+ > | EXPR$0 | > +---------+ > | 75780 | > +---------+ > > - Rahul > > > > > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht < > [email protected]> wrote: > >> What is the distinct count for this columns? IIRC TPC-H has at least 5 >> years of data irrespective of SF, so you are requesting a lot of >> partitions. 76K sounds about right for 5 years of TPCH shipmate and >> correlating receipt date data, your query doesn’t count the actual files. >> >> Try to partition just on the shipmate column first. >> >> —Andries >> >> >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected]> >> wrote: >>> >>> Hi, >>> >>> Is it possible that the combination values of (l_shipdate, >>> l_receiptdate) have a very high cardinality? >>> I would think you are creating partition files for a small subset of the >>> data. >>> >>> Please keep in mind that I know nothing about TPCH SF100 and only a >> little >>> about Drill :). >>> >>> Regards, >>> -Stefan >>> >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote: >>> >>>> It would be helpful if you could figure out what the file count is. But >>>> here are some thoughs: >>>> >>>> What is the value of the option: >>>> store.partition.hash_distribute >>>> >>>> If it is false, which it is by default, then every fragment will >>>> potentially have data in every partition. In this case, that could >> increase >>>> the number of files by a factor of 8. >>>> >>>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli < >>>> [email protected]> wrote: >>>> >>>>> Drillers, >>>>> >>>>> I executed the below query on TPCH SF100 with drill and it took ~2hrs >> to >>>>> complete on a 2 node cluster. >>>>> >>>>> alter session set `planner.width.max_per_node` = 4; >>>>> alter session set `planner.memory.max_query_memory_per_node` = >>>> 8147483648; >>>>> create table lineitem partition by (l_shipdate, l_receiptdate) as >> select >>>> * >>>>> from dfs.`/drill/testdata/tpch100/lineitem`; >>>>> >>>>> The below query returned 75780, so I expected drill to create the same >> no >>>>> of files or may be a little more. But drill created so many files that >> a >>>>> "hadoop fs -count" command failed with a "GC overhead limit exceeded". >> (I >>>>> did not change the default parquet block size) >>>>> >>>>> select count(*) from (select l_shipdate, l_receiptdate from >>>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, >>>> l_receiptdate) >>>>> sub; >>>>> +---------+ >>>>> | EXPR$0 | >>>>> +---------+ >>>>> | 75780 | >>>>> +---------+ >>>>> >>>>> >>>>> Any thoughts on why drill is creating so many files? >>>>> >>>>> - Rahul >>>>> >>>> >> >>
