Yes, it is a good point about multiple invocations of the PruneScan rule. The other point about using Java heap is not correct. The rule does off-heap allocation using memory buffer from QueryContext and in the finally block releases the memory.
Aman On Thu, Sep 10, 2015 at 6:18 PM, Jinfeng Ni <[email protected]> wrote: > I opened DRILL-3765 for the multiple rule execution issue: > > https://issues.apache.org/jira/browse/DRILL-3765 > > > On Thu, Sep 10, 2015 at 5:34 PM, Jinfeng Ni <[email protected]> wrote: > > Seems to me one important reason we hit out of heap memory for partition > > prune rule is that the rule itself is invoked multiple times, even the > > filter has been pushed into scan in the first call. > > > > I tried with a simple unit test > > TestPartitionFilter:testPartitionFilter1_Parquet_from_CTAS(), here is > the # > > of frequency of partition rules that are fired in Calcite trace > > > > #_rule_fire, rule name > > > > 4 [PruneScanRule:Filter_On_Project_Parquet] > > 4 [PruneScanRule:Filter_On_Project] > > > > 2 [PruneScanRule:Filter_On_Scan_Parquet] > > 2 [PruneScanRule:Filter_On_Scan] > > > > Setting a breaking point in PruneScanRule where it calls the interpreter > to > > evaluate the expression, I could see that the code stops 6 times in that > > point; meaning that Drill will have to build the vector containing the > > filenames at least 6 times. That would cause lots of heap memory > > consumption, if gc does not kick in to release the memory used in the > prior > > rule's execution. > > > > I think making the partition pruning multiple phases will help to reduce > the > > memory consumption. But for now, it seems important to avoid the repeated > > and unnecessary rule execution. > > > > > > > > > > > > On Thu, Sep 10, 2015 at 4:42 PM, Aman Sinha <[email protected]> wrote: > >> > >> Agree on the N phased approach. I have filed a JIRA for the > enhancement: > >> DRILL-3759. > >> Regarding the simplification of the expression tree logic..did you mean > >> the > >> logic in FindPartitionConditions or the Interpreter ? > >> Perhaps you can add comments in the JIRA with some explanation. I am in > >> favor of simplification where possible. > >> > >> On Wed, Sep 9, 2015 at 10:39 PM, Jacques Nadeau <[email protected]> > >> wrote: > >> > >> > Makes sense. > >> > > >> > Is there we can do this with lazy materializations rather than writing > >> > complex expression tree logic? I hate have no all this custom > expression > >> > tree manipulation logic. > >> > > >> > Also, it seems like this should be N phased rather than two phase > where > >> > N > >> > is the number of directories below the base path. > >> > > >> > Thoughts? > >> > On Sep 9, 2015 10:54 AM, "Aman Sinha" <[email protected]> wrote: > >> > > >> > > Currently, partition pruning gets all file names in the table and > >> > > applies > >> > > the pruning. Suppose the files are spread out over several > >> > > directories > >> > and > >> > > there is a filter on dirN, this is not efficient - both in terms > of > >> > > elapsed time and memory usage. This has been seen in a few use > cases > >> > > recently. > >> > > > >> > > We should ideally perform the pruning in 2 steps: first get the > >> > top-level > >> > > directory names only and apply the directory filter, then get the > >> > filenames > >> > > within that directory and apply remaining filters. > >> > > > >> > > I will create a JIRA for this enhancement but let me know your > >> > thoughts... > >> > > > >> > > Aman > >> > > > >> > > > > > >
