Re: Directory and file based partition pruning

Jinfeng Ni Thu, 10 Sep 2015 17:35:52 -0700

Seems to me one important reason we hit out of heap memory for partition
prune rule is that the rule itself is invoked multiple times, even the
filter has been pushed into scan in the first call.


I tried with a simple unit
test TestPartitionFilter:testPartitionFilter1_Parquet_from_CTAS(), here is
the # of frequency of partition rules that are fired in Calcite trace

 #_rule_fire,  rule name

 4 [PruneScanRule:Filter_On_Project_Parquet]
 4 [PruneScanRule:Filter_On_Project]

 2 [PruneScanRule:Filter_On_Scan_Parquet]
 2 [PruneScanRule:Filter_On_Scan]

Setting a breaking point in PruneScanRule where it calls the interpreter to
evaluate the expression, I could see that the code stops 6 times in that
point; meaning that Drill will have to build the vector containing the
filenames at least 6 times.  That would cause lots of heap memory
consumption, if gc does not kick in to release the memory used in the prior
rule's execution.

I think making the partition pruning multiple phases will help to reduce
the memory consumption. But for now, it seems important to avoid the
repeated and unnecessary rule execution.





On Thu, Sep 10, 2015 at 4:42 PM, Aman Sinha <[email protected]> wrote:

> Agree on the N phased approach.  I have filed a JIRA for the enhancement:
>  DRILL-3759.
> Regarding the simplification of the expression tree logic..did you mean the
> logic in FindPartitionConditions  or the Interpreter ?
> Perhaps you can add comments in the JIRA with some explanation.  I am in
> favor of simplification where possible.
>
> On Wed, Sep 9, 2015 at 10:39 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > Makes sense.
> >
> > Is there we can do this with lazy materializations rather than writing
> > complex expression tree logic? I hate have no all this custom expression
> > tree manipulation logic.
> >
> > Also, it seems like this should be N phased rather than two phase where N
> > is the number of directories below the base path.
> >
> > Thoughts?
> > On Sep 9, 2015 10:54 AM, "Aman Sinha" <[email protected]> wrote:
> >
> > > Currently, partition pruning gets all file names in the table and
> applies
> > > the pruning.  Suppose the files are spread out over several directories
> > and
> > > there is a filter  on dirN,  this is not efficient - both in terms of
> > > elapsed time and memory usage.  This has been seen in a few use cases
> > > recently.
> > >
> > > We should ideally perform the pruning in 2 steps:  first get the
> > top-level
> > > directory names only and apply the directory filter, then get the
> > filenames
> > > within that directory and apply remaining filters.
> > >
> > > I will create a JIRA for this enhancement but let me know your
> > thoughts...
> > >
> > > Aman
> > >
> >
>

Re: Directory and file based partition pruning

Reply via email to