Steven, Jason : Below is my understanding of when we should spill to disk while performing a sort. Let me know if I am missing anything
alter session set `planner.width.max_per_node` = 4; alter session set `planner.memory.max_query_memory_per_node` = 8147483648; (~8GB) create table lineitem partition by (l_shipdate, l_receiptdate) as select * from dfs.`/drill/testdata/tpch100/lineitem`; 1. The above query creates 4 minor fragments and each minor fragment gets ~2GB for the sort phase. 2. Once a minor fragment cosumes ~2GB of memory, is starts spilling each partition into a separate file to disk 3. The spilled files would be of different sizes. 4. Now if it is a regular CTAS (with no partition by clause), each spilled file should be approximately ~2GB in size I just have a hunch that we are spilling a little early :) - Rahul On Wed, Aug 26, 2015 at 4:49 PM, rahul challapalli < [email protected]> wrote: > Jason, > > What you described is exactly my understanding. > > I did kickoff a run after setting `store.partition.hash_distribute`. It is > still running. I am expecting the no of files to be slightly more than or > equal to 75780. (As the default parquet block size should be sufficient for > most of the partitions) > > - Rahul > > > > On Wed, Aug 26, 2015 at 4:36 PM, Jason Altekruse <[email protected] > > wrote: > >> I feel like there is a little misunderstanding here. >> >> Rahul, did you try setting the option that Steven suggested? >> `store.partition.hash_distribute` >> >> This will cause a re-distribution of the data so that the rows that belong >> in a particular partition will all be written by a single writer. They >> will >> not necessarily be all in one file, as we have a limit on file sizes and I >> don't think we cap partition size. >> >> The default behavior is not to re-distribute, because it is expensive. >> This >> however means that every fragment will write out a file for whichever keys >> appear in the data that ends up at that fragment. >> >> If there is a large number of fragments and the data is spread out pretty >> randomly, then there is a reasonable case for turning on this option to >> co-locate data in a single partition to a single writer to reduce the >> number of smaller files. There is no magic formula for when it is best to >> turn on this option, but in most cases it will reduce the number of files >> produced. >> >> >> >> On Wed, Aug 26, 2015 at 3:48 PM, rahul challapalli < >> [email protected]> wrote: >> >> > Well this for generating some testdata >> > >> > - Rahul >> > >> > On Wed, Aug 26, 2015 at 3:47 PM, Andries Engelbrecht < >> > [email protected]> wrote: >> > >> > > Looks like Drill is doing the partitioning as requested then. May not >> be >> > > optimal though. >> > > >> > > Is there a reason why you want to subpartition this much? You may be >> > > better of to just partition by l_shipdate (not shipmate, autocorrect >> got >> > me >> > > there). Or use columns with much lower cardinality to test >> > subpartitioning. >> > > >> > > —Andries >> > > >> > > >> > > > On Aug 26, 2015, at 3:05 PM, rahul challapalli < >> > > [email protected]> wrote: >> > > > >> > > > Steven, >> > > > >> > > > You were right. The count is 606240 which is 8*75780. >> > > > >> > > > >> > > > Stefan & Andries, >> > > > >> > > > Below is the distinct count or cardinality >> > > > >> > > > select count(*) from (select l_shipdate, l_receiptdate from >> > > > dfs.`/drill/testdata/tpch100/ >> > > > lineitem` group by l_shipdate, l_receiptdate) sub; >> > > > +---------+ >> > > > | EXPR$0 | >> > > > +---------+ >> > > > | 75780 | >> > > > +---------+ >> > > > >> > > > - Rahul >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > On Wed, Aug 26, 2015 at 1:26 PM, Andries Engelbrecht < >> > > > [email protected]> wrote: >> > > > >> > > >> What is the distinct count for this columns? IIRC TPC-H has at >> least 5 >> > > >> years of data irrespective of SF, so you are requesting a lot of >> > > >> partitions. 76K sounds about right for 5 years of TPCH shipmate and >> > > >> correlating receipt date data, your query doesn’t count the actual >> > > files. >> > > >> >> > > >> Try to partition just on the shipmate column first. >> > > >> >> > > >> —Andries >> > > >> >> > > >> >> > > >>> On Aug 26, 2015, at 12:34 PM, Stefán Baxter < >> > [email protected] >> > > > >> > > >> wrote: >> > > >>> >> > > >>> Hi, >> > > >>> >> > > >>> Is it possible that the combination values of (l_shipdate, >> > > >>> l_receiptdate) have a very high cardinality? >> > > >>> I would think you are creating partition files for a small subset >> of >> > > the >> > > >>> data. >> > > >>> >> > > >>> Please keep in mind that I know nothing about TPCH SF100 and only >> a >> > > >> little >> > > >>> about Drill :). >> > > >>> >> > > >>> Regards, >> > > >>> -Stefan >> > > >>> >> > > >>> On Wed, Aug 26, 2015 at 7:30 PM, Steven Phillips <[email protected]> >> > > wrote: >> > > >>> >> > > >>>> It would be helpful if you could figure out what the file count >> is. >> > > But >> > > >>>> here are some thoughs: >> > > >>>> >> > > >>>> What is the value of the option: >> > > >>>> store.partition.hash_distribute >> > > >>>> >> > > >>>> If it is false, which it is by default, then every fragment will >> > > >>>> potentially have data in every partition. In this case, that >> could >> > > >> increase >> > > >>>> the number of files by a factor of 8. >> > > >>>> >> > > >>>> On Wed, Aug 26, 2015 at 12:21 PM, rahul challapalli < >> > > >>>> [email protected]> wrote: >> > > >>>> >> > > >>>>> Drillers, >> > > >>>>> >> > > >>>>> I executed the below query on TPCH SF100 with drill and it took >> > ~2hrs >> > > >> to >> > > >>>>> complete on a 2 node cluster. >> > > >>>>> >> > > >>>>> alter session set `planner.width.max_per_node` = 4; >> > > >>>>> alter session set `planner.memory.max_query_memory_per_node` = >> > > >>>> 8147483648; >> > > >>>>> create table lineitem partition by (l_shipdate, l_receiptdate) >> as >> > > >> select >> > > >>>> * >> > > >>>>> from dfs.`/drill/testdata/tpch100/lineitem`; >> > > >>>>> >> > > >>>>> The below query returned 75780, so I expected drill to create >> the >> > > same >> > > >> no >> > > >>>>> of files or may be a little more. But drill created so many >> files >> > > that >> > > >> a >> > > >>>>> "hadoop fs -count" command failed with a "GC overhead limit >> > > exceeded". >> > > >> (I >> > > >>>>> did not change the default parquet block size) >> > > >>>>> >> > > >>>>> select count(*) from (select l_shipdate, l_receiptdate from >> > > >>>>> dfs.`/drill/testdata/tpch100/lineitem` group by l_shipdate, >> > > >>>> l_receiptdate) >> > > >>>>> sub; >> > > >>>>> +---------+ >> > > >>>>> | EXPR$0 | >> > > >>>>> +---------+ >> > > >>>>> | 75780 | >> > > >>>>> +---------+ >> > > >>>>> >> > > >>>>> >> > > >>>>> Any thoughts on why drill is creating so many files? >> > > >>>>> >> > > >>>>> - Rahul >> > > >>>>> >> > > >>>> >> > > >> >> > > >> >> > > >> > > >> > >> > >
