Well this for generating some testdata - Rahul
On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht < [email protected]> wrote: > Looks like Drill is doing the partitioning as requested then. May not be > optimal though. > > Is there a reason why you want to subpartition this much? You may be > better of to just partition by l_shipdate (not shipmate, autocorrect got me > there). Or use columns with much lower cardinality to test subpartitioning. > > —Andries > > > > On Aug 26, 2015, at 3:05 PM, rahul challapalli < > [email protected]> wrote: > > > > Steven, > > > > You were right. The count is 606240 which is 8*75780. > > > > > > Stefan & Andries, > > > > Below is the distinct count or cardinality > > > > select count(*) from (select l_shipdate, l_receiptdate from > > dfs.`/drill/testdata/tpch100/ > > lineitem` group by l_shipdate, l_receiptdate) sub; > > +---------+ > > | EXPR$0 | > > +---------+ > > | 75780 | > > +---------+ > > > > - Rahul > > > > > > > > > > > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht < > > [email protected]> wrote: > > > >> What is the distinct count for this columns? IIRC TPC-H has at least 5 > >> years of data irrespective of SF, so you are requesting a lot of > >> partitions. 76K sounds about right for 5 years of TPCH shipmate and > >> correlating receipt date data, your query doesn’t count the actual > files. > >> > >> Try to partition just on the shipmate column first. > >> > >> —Andries > >> > >> > >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected] > > > >> wrote: > >>> > >>> Hi, > >>> > >>> Is it possible that the combination values of (l_shipdate, > >>> l_receiptdate) have a very high cardinality? > >>> I would think you are creating partition files for a small subset of > the > >>> data. > >>> > >>> Please keep in mind that I know nothing about TPCH SF100 and only a > >> little > >>> about Drill :). > >>> > >>> Regards, > >>> -Stefan > >>> > >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> > wrote: > >>> > >>>> It would be helpful if you could figure out what the file count is. > But > >>>> here are some thoughs: > >>>> > >>>> What is the value of the option: > >>>> store.partition.hash_distribute > >>>> > >>>> If it is false, which it is by default, then every fragment will > >>>> potentially have data in every partition. In this case, that could > >> increase > >>>> the number of files by a factor of 8. > >>>> > >>>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli < > >>>> [email protected]> wrote: > >>>> > >>>>> Drillers, > >>>>> > >>>>> I executed the below query on TPCH SF100 with drill and it took ~2hrs > >> to > >>>>> complete on a 2 node cluster. > >>>>> > >>>>> alter session set `planner.width.max_per_node` = 4; > >>>>> alter session set `planner.memory.max_query_memory_per_node` = > >>>> 8147483648; > >>>>> create table lineitem partition by (l_shipdate, l_receiptdate) as > >> select > >>>> * > >>>>> from dfs.`/drill/testdata/tpch100/lineitem`; > >>>>> > >>>>> The below query returned 75780, so I expected drill to create the > same > >> no > >>>>> of files or may be a little more. But drill created so many files > that > >> a > >>>>> "hadoop fs -count" command failed with a "GC overhead limit > exceeded". > >> (I > >>>>> did not change the default parquet block size) > >>>>> > >>>>> select count(*) from (select l_shipdate, l_receiptdate from > >>>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, > >>>> l_receiptdate) > >>>>> sub; > >>>>> +---------+ > >>>>> | EXPR$0 | > >>>>> +---------+ > >>>>> | 75780 | > >>>>> +---------+ > >>>>> > >>>>> > >>>>> Any thoughts on why drill is creating so many files? > >>>>> > >>>>> - Rahul > >>>>> > >>>> > >> > >> > >
