It would be helpful if you could figure out what the file count is. But here are some thoughs:
What is the value of the option: store.partition.hash_distribute If it is false, which it is by default, then every fragment will potentially have data in every partition. In this case, that could increase the number of files by a factor of 8. On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli < [email protected]> wrote: > Drillers, > > I executed the below query on TPCH SF100 with drill and it took ~2hrs to > complete on a 2 node cluster. > > alter session set `planner.width.max_per_node` = 4; > alter session set `planner.memory.max_query_memory_per_node` = 8147483648; > create table lineitem partition by (l_shipdate, l_receiptdate) as select * > from dfs.`/drill/testdata/tpch100/lineitem`; > > The below query returned 75780, so I expected drill to create the same no > of files or may be a little more. But drill created so many files that a > "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I > did not change the default parquet block size) > > select count(*) from (select l_shipdate, l_receiptdate from > dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, l_receiptdate) > sub; > +---------+ > | EXPR$0 | > +---------+ > | 75780 | > +---------+ > > > Any thoughts on why drill is creating so many files? > > - Rahul >
