What is the distinct count for this columns? IIRC TPC-H has at least 5 years of data irrespective of SF, so you are requesting a lot of partitions. 76K sounds about right for 5 years of TPCH shipmate and correlating receipt date data, your query doesn’t count the actual files.
Try to partition just on the shipmate column first. —Andries > On Aug 26, 2015, at 12:34 PM, Stefán Baxter <[email protected]> wrote: > > Hi, > > Is it possible that the combination values of (l_shipdate, > l_receiptdate) have a very high cardinality? > I would think you are creating partition files for a small subset of the > data. > > Please keep in mind that I know nothing about TPCH SF100 and only a little > about Drill :). > > Regards, > -Stefan > > On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote: > >> It would be helpful if you could figure out what the file count is. But >> here are some thoughs: >> >> What is the value of the option: >> store.partition.hash_distribute >> >> If it is false, which it is by default, then every fragment will >> potentially have data in every partition. In this case, that could increase >> the number of files by a factor of 8. >> >> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli < >> [email protected]> wrote: >> >>> Drillers, >>> >>> I executed the below query on TPCH SF100 with drill and it took ~2hrs to >>> complete on a 2 node cluster. >>> >>> alter session set `planner.width.max_per_node` = 4; >>> alter session set `planner.memory.max_query_memory_per_node` = >> 8147483648; >>> create table lineitem partition by (l_shipdate, l_receiptdate) as select >> * >>> from dfs.`/drill/testdata/tpch100/lineitem`; >>> >>> The below query returned 75780, so I expected drill to create the same no >>> of files or may be a little more. But drill created so many files that a >>> "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I >>> did not change the default parquet block size) >>> >>> select count(*) from (select l_shipdate, l_receiptdate from >>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, >> l_receiptdate) >>> sub; >>> +---------+ >>> | EXPR$0 | >>> +---------+ >>> | 75780 | >>> +---------+ >>> >>> >>> Any thoughts on why drill is creating so many files? >>> >>> - Rahul >>> >>
