Drillers,

I executed the below query on TPCH SF100 with drill and it took ~2hrs to
complete on a 2 node cluster.

alter session set `planner.width.max_per_node` = 4;
alter session set `planner.memory.max_query_memory_per_node` = 8147483648;
create table lineitem partition by (l_shipdate, l_receiptdate) as select *
from dfs.`/drill/testdata/tpch100/lineitem`;

The below query returned 75780, so I expected drill to create the same no
of files or may be a little more. But drill created so many files that a
"hadoop fs -count" command failed with a "GC overhead limit exceeded". (I
did not change the default parquet block size)

select count(*) from (select l_shipdate, l_receiptdate from
dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, l_receiptdate)
sub;
+---------+
| EXPR$0  |
+---------+
| 75780   |
+---------+


Any thoughts on why drill is creating so many files?

- Rahul

Reply via email to