Hi, Is it possible that the combination values of (l_shipdate, l_receiptdate) have a very high cardinality? I would think you are creating partition files for a small subset of the data.
Please keep in mind that I know nothing about TPCH SF100 and only a little about Drill :). Regards, -Stefan On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> wrote: > It would be helpful if you could figure out what the file count is. But > here are some thoughs: > > What is the value of the option: > store.partition.hash_distribute > > If it is false, which it is by default, then every fragment will > potentially have data in every partition. In this case, that could increase > the number of files by a factor of 8. > > On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli < > [email protected]> wrote: > > > Drillers, > > > > I executed the below query on TPCH SF100 with drill and it took ~2hrs to > > complete on a 2 node cluster. > > > > alter session set `planner.width.max_per_node` = 4; > > alter session set `planner.memory.max_query_memory_per_node` = > 8147483648; > > create table lineitem partition by (l_shipdate, l_receiptdate) as select > * > > from dfs.`/drill/testdata/tpch100/lineitem`; > > > > The below query returned 75780, so I expected drill to create the same no > > of files or may be a little more. But drill created so many files that a > > "hadoop fs -count" command failed with a "GC overhead limit exceeded". (I > > did not change the default parquet block size) > > > > select count(*) from (select l_shipdate, l_receiptdate from > > dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, > l_receiptdate) > > sub; > > +---------+ > > | EXPR$0 | > > +---------+ > > | 75780 | > > +---------+ > > > > > > Any thoughts on why drill is creating so many files? > > > > - Rahul > > >
