Hi Dmitriy and Thejas, Should I open a jira for the same?
Thanks, Aniket On Wed, Apr 25, 2012 at 1:45 PM, Dmitriy Ryaboy <[email protected]> wrote: > Yeah I think we just need to get projection pushdown to work through > Split operators. > > D > > On Wed, Apr 25, 2012 at 12:52 PM, Thejas Nair <[email protected]> > wrote: > > cc'ing dev@pig as this is a pig issue. > > > > Aniket, What you saw is not related to PIG-2339 . > > > > In your example query, the logical plan will look like this - > > > > Load (A) > > | > > Split > > | > > --------------------------- > > | | > > Filter(B1) Filter(B2) ... > > > > Because of the split operator introduced between the filter conditions > and > > load, the filter does not get pushed into the load function. > > > > A simple way to fix this in pig would be to not share the load across the > > filter operators. Another option is to push the condition (B1 or B2 or > B3) > > into Load operator and retain rest of the current plan (split and filters > > following the split). > > > > You can ofcourse achieve the same effect by having a separate load > > statememnt as input for each of the filters. > > > > I agree that we should make it possible to ask pig to throw a > warning/error > > if the query is going to result in a full table scan on a partitioned > table. > > > > Thanks, > > Thejas > > > > > > > > > > On 4/24/12 7:56 PM, Aniket Mokashi wrote: > >> > >> Sorry Thejas, I didnt look into the jira properly earlier. > >> EMR pig-0.9.1 already has that patch for PIG-2339 and hence I did not > >> hit that issue earlier (and I patched datanucleus). filter-union was a > >> workaround I was using to avoid some of the thrift timeout problems > >> earlier. Thrift api's timeout on client side in 20sec by default (I > >> found the config to change this later) and I hence used a = load > >> 'table'; b1= filter by cond1; b2=filter by cond2;.. b= union b1, b2..; > >> to expect to push these filters separately to the loader. But, that > >> doesn't work in pig. (I can open a jira, but I havent done enough > >> investigation at the code level). Thoughts? > >> > >> Thanks, > >> Aniket > >> > >> On Tue, Apr 24, 2012 at 7:00 PM, Thejas Nair <[email protected] > >> <mailto:[email protected]>> wrote: > >> > >> The issue was not specific to filter-union > >> - https://issues.apache.org/__jira/browse/PIG-2339 > >> <https://issues.apache.org/jira/browse/PIG-2339>. > >> The fix was to do filter PushUpFilter before > PartitionFilterOptimizer . > >> > >> As this is not a hcat issue, it should not matter if you have an > >> older hcat version . fyi, this bug was not there in pig 0.8.x . > >> Was it pig 0.9.0 or 0.9.1 that you used ? > >> > >> Thanks, > >> Thejas > >> > >> > >> > >> On 4/24/12 5:21 PM, Aniket Mokashi wrote: > >> > >> Hi Thejas, > >> > >> Can you point me to jira that fixes filter-union problem (in > pig)? > >> I > >> haven't tried hcat-0.4 yet, good to know about that issue. I > >> will keep a > >> watcher. > >> > >> Thanks, > >> Aniket > >> > >> On Tue, Apr 24, 2012 at 4:51 PM, Thejas Nair > >> <[email protected] <mailto:[email protected]> > >> <mailto:[email protected] > >> <mailto:[email protected]>__>> wrote: > >> > >> Hi Aniket, > >> Are you using pig 0.9 or 0.9.1 ? > >> If yes, can you try with pig 0.9.2 ? > >> Wondering if you are also hitting the issue that Thomas > >> mentioned . > >> > >> Thanks, > >> Thejas > >> > >> > >> > >> > >> On 4/23/12 7:39 PM, Aniket Mokashi wrote: > >> > >> Something similar I have noticed is - > >> > >> A = load ... > >> B1 = filter A by cond1; > >> B2 = filter A by cond2; > >> B3 = filter A by cond3; > >> > >> B = union B1, B2, B3; does not push projection. > >> > >> Is that expected? > >> > >> Ideally, we should have "strict" mode under hcatalog, > >> that when > >> turned > >> on will avoid executing pig queries on the full > >> (partitioned) table. > >> > >> Thanks, > >> Aniket > >> > >> On Mon, Apr 23, 2012 at 7:32 PM, Rajesh Balamohan > >> <[email protected] <mailto:[email protected]> > >> <mailto:rajesh.balamohan@__gmail.com > >> <mailto:[email protected]>> > >> <mailto:rajesh.balamohan@ > >> <mailto:rajesh.balamohan@>__gma__il.com <http://gmail.com> > >> > >> <mailto:rajesh.balamohan@__gmail.com > >> <mailto:[email protected]>>>> wrote: > >> > >> Hi Alan, > >> > >> Thanks for the quick response. > >> > >> I am using HCatalog 0.4. > >> > >> With simple PIG script it works great. HCatalog > >> beautifully > >> scans > >> only the relevant information. However, full scan > >> happens > >> only when > >> we have couple of additional joins and when we > >> change the > >> INNER JOIN > >> order (we also use "using skewed"). > >> > >> Though we have looked into the debug logs, we saw the > >> scanning of > >> number of records from the JobTracker's counters > >> itself. Without > >> pruning, the m/r job was pretty much scanning the > >> entire set > >> of rows. > >> > >> I am not sure if there is a corner case, where in > >> "skewed" > >> join is > >> trying to override the filtering. > >> > >> ~Rajesh.B > >> > >> > >> > >> On Tue, Apr 24, 2012 at 2:13 AM, Alan Gates > >> <[email protected] <mailto:[email protected]> > >> <mailto:[email protected] <mailto:[email protected]>> > >> <mailto:[email protected] <mailto:[email protected]> > >> <mailto:[email protected] <mailto:[email protected] > >>__>__> > >> > >> wrote: > >> > >> What version of HCatalog are you using? How do > >> you know > >> it is > >> scanning all the partitions, does it say so in > >> the logs, > >> or are > >> you getting all the records back? > >> > >> And yes, HCat is supposed to do partition > >> pruning so that it > >> only scans the required partitions. > >> > >> Alan. > >> > >> On Apr 21, 2012, at 8:27 PM, Rajesh Balamohan > >> wrote: > >> > >> > Hi All, > >> > > >> > I have a hcatalog table "partitioned by (d string)". > >> > > >> > I have couple of days worth of data and when i run "show > >> partitions" it provides the correct daa. > >> > > >> > d=20111215 > >> > d=20111216 > >> > d=20111217 > >> > d=20111218 > >> > d=20111219 > >> > d=20111220 > >> > d=20111221 > >> > d=20111222 > >> > d=20111223 > >> > d=20111224 > >> > d=20111225 > >> > d=20120415 > >> > > >> > However, when I run PIG with "filter a by d == '20120415'", > >> it ends up scanning all data. > >> > > >> > Is this a known bug/enhancement in HCatalog?. Ideally, > >> shouldn't it scan only the d=20120415 directory? > >> > > >> > Any pointers would be of great help. > >> > > >> > > >> > -- > >> > ~Rajesh.B > >> > >> > >> > >> > >> -- > >> ~Rajesh.B > >> > >> > >> > >> > >> -- > >> "...:::Aniket:::... Quetzalco@tl" > >> > >> > >> > >> > >> > >> -- > >> "...:::Aniket:::... Quetzalco@tl" > >> > >> > >> > >> > >> > >> -- > >> "...:::Aniket:::... Quetzalco@tl" > > > > > -- "...:::Aniket:::... Quetzalco@tl"
