Re: HCatalog scans all partition even after mentioning date filter

Aniket Mokashi Wed, 25 Apr 2012 13:48:09 -0700

Hi Dmitriy and Thejas,

Should I open a jira for the same?


Thanks,
Aniket


On Wed, Apr 25, 2012 at 1:45 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Yeah I think we just need to get projection pushdown to work through
> Split operators.
>
> D
>
> On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair <[email protected]>
> wrote:
> > cc'ing dev@pig as this is a pig issue.
> >
> > Aniket, What you saw is not related to PIG-2339 .
> >
> > In your example query, the logical plan will look like this -
> >
> > Load (A)
> > |
> > Split
> >  |
> > ---------------------------
> > |             |
> > Filter(B1)   Filter(B2) ...
> >
> > Because of the split operator introduced between the filter conditions
> and
> > load, the filter does not get pushed into the load function.
> >
> > A simple way to fix this in pig would be to not share the load across the
> > filter operators. Another option is to push the condition (B1 or B2 or
> B3)
> > into Load operator and retain rest of the current plan (split and filters
> > following the split).
> >
> > You can ofcourse achieve the same effect by having a separate load
> > statememnt as input for each of the filters.
> >
> > I agree that we should make it possible to ask pig to throw a
> warning/error
> > if the query is going to result in a full table scan on a partitioned
> table.
> >
> > Thanks,
> > Thejas
> >
> >
> >
> >
> > On 4/24/12 7:56 PM, Aniket Mokashi wrote:
> >>
> >> Sorry Thejas, I didnt look into the jira properly earlier.
> >> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I did not
> >> hit that issue earlier (and I patched datanucleus). filter-union was a
> >> workaround I was using to avoid some of the thrift timeout problems
> >> earlier. Thrift api's timeout on client side in 20sec by default (I
> >> found the config to change this later) and I hence used a = load
> >> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1, b2..;
> >> to expect to push these filters separately to the loader. But, that
> >> doesn't work in pig. (I can open a jira, but I havent done enough
> >> investigation at the code level). Thoughts?
> >>
> >> Thanks,
> >> Aniket
> >>
> >> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair <[email protected]
> >> <mailto:[email protected]>> wrote:
> >>
> >>    The issue was not specific to filter-union
> >>    - https://issues.apache.org/__jira/browse/PIG-2339
> >>    <https://issues.apache.org/jira/browse/PIG-2339>.
> >>    The fix was to do filter PushUpFilter before
> PartitionFilterOptimizer .
> >>
> >>    As this is not a hcat issue, it should not matter if you have an
> >>    older hcat version .  fyi, this bug was not there in pig 0.8.x .
> >>    Was it pig 0.9.0 or 0.9.1 that you used ?
> >>
> >>    Thanks,
> >>    Thejas
> >>
> >>
> >>
> >>    On 4/24/12 5:21 PM, Aniket Mokashi wrote:
> >>
> >>        Hi Thejas,
> >>
> >>        Can you point me to jira that fixes filter-union problem (in
> pig)?
> >> I
> >>        haven't tried hcat-0.4 yet, good to know about that issue. I
> >>        will keep a
> >>        watcher.
> >>
> >>        Thanks,
> >>        Aniket
> >>
> >>        On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair
> >>        <[email protected] <mailto:[email protected]>
> >>        <mailto:[email protected]
> >>        <mailto:[email protected]>__>> wrote:
> >>
> >>            Hi Aniket,
> >>            Are you using pig 0.9 or 0.9.1 ?
> >>            If yes, can you try with pig 0.9.2 ?
> >>            Wondering if you are also hitting the issue that Thomas
> >>        mentioned .
> >>
> >>            Thanks,
> >>            Thejas
> >>
> >>
> >>
> >>
> >>            On 4/23/12 7:39 PM, Aniket Mokashi wrote:
> >>
> >>                Something similar I have noticed is -
> >>
> >>                A = load ...
> >>                B1 = filter A by cond1;
> >>                B2 = filter A by cond2;
> >>                B3 = filter A by cond3;
> >>
> >>                B = union B1, B2, B3; does not push projection.
> >>
> >>                Is that expected?
> >>
> >>                Ideally, we should have "strict" mode under hcatalog,
> >>        that when
> >>                turned
> >>                on will avoid executing pig queries on the full
> >>        (partitioned) table.
> >>
> >>                Thanks,
> >>                Aniket
> >>
> >>                On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan
> >>        <[email protected] <mailto:[email protected]>
> >>        <mailto:rajesh.balamohan@__gmail.com
> >>        <mailto:[email protected]>>
> >>        <mailto:rajesh.balamohan@
> >>        <mailto:rajesh.balamohan@>__gma__il.com <http://gmail.com>
> >>
> >>        <mailto:rajesh.balamohan@__gmail.com
> >>        <mailto:[email protected]>>>> wrote:
> >>
> >>                    Hi Alan,
> >>
> >>                    Thanks for the quick response.
> >>
> >>                    I am using HCatalog 0.4.
> >>
> >>                    With simple PIG script it works great. HCatalog
> >>        beautifully
> >>                scans
> >>                    only the relevant information. However, full scan
> >>        happens
> >>                only when
> >>                    we have couple of additional joins and when we
> >>        change the
> >>                INNER JOIN
> >>                    order (we also use "using skewed").
> >>
> >>                    Though we have looked into the debug logs, we saw the
> >>                scanning of
> >>                    number of records from the JobTracker's counters
> >>        itself. Without
> >>                    pruning, the m/r job was pretty much scanning the
> >>        entire set
> >>                of rows.
> >>
> >>                    I am not sure if there is a corner case, where in
> >>        "skewed"
> >>                join is
> >>                    trying to override the filtering.
> >>
> >>                    ~Rajesh.B
> >>
> >>
> >>
> >>                    On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates
> >>        <[email protected] <mailto:[email protected]>
> >>        <mailto:[email protected] <mailto:[email protected]>>
> >>        <mailto:[email protected] <mailto:[email protected]>
> >>        <mailto:[email protected] <mailto:[email protected]
> >>__>__>
> >>
> >>                wrote:
> >>
> >>                        What version of HCatalog are you using?  How do
> >>        you know
> >>                it is
> >>                        scanning all the partitions, does it say so in
> >>        the logs,
> >>                or are
> >>                        you getting all the records back?
> >>
> >>                        And yes, HCat is supposed to do partition
> >>        pruning so that it
> >>                        only scans the required partitions.
> >>
> >>                        Alan.
> >>
> >>                        On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan
> >> wrote:
> >>
> >>         > Hi All,
> >>         >
> >>         > I have a hcatalog table "partitioned by (d string)".
> >>         >
> >>         > I have couple of days worth of data and when i run "show
> >>                        partitions" it provides the correct daa.
> >>         >
> >>         > d=20111215
> >>         > d=20111216
> >>         > d=20111217
> >>         > d=20111218
> >>         > d=20111219
> >>         > d=20111220
> >>         > d=20111221
> >>         > d=20111222
> >>         > d=20111223
> >>         > d=20111224
> >>         > d=20111225
> >>         > d=20120415
> >>         >
> >>         > However, when I run PIG with "filter a by d == '20120415'",
> >>                        it ends up scanning all data.
> >>         >
> >>         > Is this a known bug/enhancement in HCatalog?. Ideally,
> >>                        shouldn't it scan only the d=20120415 directory?
> >>         >
> >>         > Any pointers would be of great help.
> >>         >
> >>         >
> >>         > --
> >>         > ~Rajesh.B
> >>
> >>
> >>
> >>
> >>                    --
> >>                    ~Rajesh.B
> >>
> >>
> >>
> >>
> >>                --
> >>        "...:::Aniket:::... Quetzalco@tl"
> >>
> >>
> >>
> >>
> >>
> >>        --
> >>        "...:::Aniket:::... Quetzalco@tl"
> >>
> >>
> >>
> >>
> >>
> >> --
> >> "...:::Aniket:::... Quetzalco@tl"
> >
> >
>



-- 
"...:::Aniket:::... Quetzalco@tl"

Re: HCatalog scans all partition even after mentioning date filter

Reply via email to