Seems to me one important reason we hit out of heap memory for partition prune rule is that the rule itself is invoked multiple times, even the filter has been pushed into scan in the first call.
I tried with a simple unit test TestPartitionFilter:testPartitionFilter1_Parquet_from_CTAS(), here is the # of frequency of partition rules that are fired in Calcite trace #_rule_fire, rule name 4 [PruneScanRule:Filter_On_Project_Parquet] 4 [PruneScanRule:Filter_On_Project] 2 [PruneScanRule:Filter_On_Scan_Parquet] 2 [PruneScanRule:Filter_On_Scan] Setting a breaking point in PruneScanRule where it calls the interpreter to evaluate the expression, I could see that the code stops 6 times in that point; meaning that Drill will have to build the vector containing the filenames at least 6 times. That would cause lots of heap memory consumption, if gc does not kick in to release the memory used in the prior rule's execution. I think making the partition pruning multiple phases will help to reduce the memory consumption. But for now, it seems important to avoid the repeated and unnecessary rule execution. On Thu, Sep 10, 2015 at 4:42 PM, Aman Sinha <[email protected]> wrote: > Agree on the N phased approach. I have filed a JIRA for the enhancement: > DRILL-3759. > Regarding the simplification of the expression tree logic..did you mean the > logic in FindPartitionConditions or the Interpreter ? > Perhaps you can add comments in the JIRA with some explanation. I am in > favor of simplification where possible. > > On Wed, Sep 9, 2015 at 10:39 PM, Jacques Nadeau <[email protected]> > wrote: > > > Makes sense. > > > > Is there we can do this with lazy materializations rather than writing > > complex expression tree logic? I hate have no all this custom expression > > tree manipulation logic. > > > > Also, it seems like this should be N phased rather than two phase where N > > is the number of directories below the base path. > > > > Thoughts? > > On Sep 9, 2015 10:54 AM, "Aman Sinha" <[email protected]> wrote: > > > > > Currently, partition pruning gets all file names in the table and > applies > > > the pruning. Suppose the files are spread out over several directories > > and > > > there is a filter on dirN, this is not efficient - both in terms of > > > elapsed time and memory usage. This has been seen in a few use cases > > > recently. > > > > > > We should ideally perform the pruning in 2 steps: first get the > > top-level > > > directory names only and apply the directory filter, then get the > > filenames > > > within that directory and apply remaining filters. > > > > > > I will create a JIRA for this enhancement but let me know your > > thoughts... > > > > > > Aman > > > > > >
