I opened DRILL-3765 for the multiple rule execution issue: https://issues.apache.org/jira/browse/DRILL-3765
On Thu, Sep 10, 2015 at 5:34 PM, Jinfeng Ni <[email protected]> wrote: > Seems to me one important reason we hit out of heap memory for partition > prune rule is that the rule itself is invoked multiple times, even the > filter has been pushed into scan in the first call. > > I tried with a simple unit test > TestPartitionFilter:testPartitionFilter1_Parquet_from_CTAS(), here is the # > of frequency of partition rules that are fired in Calcite trace > > #_rule_fire, rule name > > 4 [PruneScanRule:Filter_On_Project_Parquet] > 4 [PruneScanRule:Filter_On_Project] > > 2 [PruneScanRule:Filter_On_Scan_Parquet] > 2 [PruneScanRule:Filter_On_Scan] > > Setting a breaking point in PruneScanRule where it calls the interpreter to > evaluate the expression, I could see that the code stops 6 times in that > point; meaning that Drill will have to build the vector containing the > filenames at least 6 times. That would cause lots of heap memory > consumption, if gc does not kick in to release the memory used in the prior > rule's execution. > > I think making the partition pruning multiple phases will help to reduce the > memory consumption. But for now, it seems important to avoid the repeated > and unnecessary rule execution. > > > > > > On Thu, Sep 10, 2015 at 4:42 PM, Aman Sinha <[email protected]> wrote: >> >> Agree on the N phased approach. I have filed a JIRA for the enhancement: >> DRILL-3759. >> Regarding the simplification of the expression tree logic..did you mean >> the >> logic in FindPartitionConditions or the Interpreter ? >> Perhaps you can add comments in the JIRA with some explanation. I am in >> favor of simplification where possible. >> >> On Wed, Sep 9, 2015 at 10:39 PM, Jacques Nadeau <[email protected]> >> wrote: >> >> > Makes sense. >> > >> > Is there we can do this with lazy materializations rather than writing >> > complex expression tree logic? I hate have no all this custom expression >> > tree manipulation logic. >> > >> > Also, it seems like this should be N phased rather than two phase where >> > N >> > is the number of directories below the base path. >> > >> > Thoughts? >> > On Sep 9, 2015 10:54 AM, "Aman Sinha" <[email protected]> wrote: >> > >> > > Currently, partition pruning gets all file names in the table and >> > > applies >> > > the pruning. Suppose the files are spread out over several >> > > directories >> > and >> > > there is a filter on dirN, this is not efficient - both in terms of >> > > elapsed time and memory usage. This has been seen in a few use cases >> > > recently. >> > > >> > > We should ideally perform the pruning in 2 steps: first get the >> > top-level >> > > directory names only and apply the directory filter, then get the >> > filenames >> > > within that directory and apply remaining filters. >> > > >> > > I will create a JIRA for this enhancement but let me know your >> > thoughts... >> > > >> > > Aman >> > > >> > > >
