Casper Rasmussen
Fri, 15 Feb 2008 09:46:53 -0800
Cool, even without the 'store split', it's nice working with Pig, and my current work is build on the fact that the storage point is root of the operations, so for now nothing is wasted :-) Thanks... On Fri, Feb 15, 2008 at 5:32 PM, Olga Natkovich <[EMAIL PROTECTED]> wrote: > Actually, we do allow user to set job name: > > set job.name 'foo'. > > http://wiki.apache.org/pig/Grunt > > Olga > > > -----Original Message----- > > From: Alan Gates [EMAIL PROTECTED] > > Sent: Friday, February 15, 2008 8:05 AM > > To: pig-user@incubator.apache.org > > Subject: Re: Storage split question, load asterisks, > > userdefined job names > > > > > > > > Casper Rasmussen wrote: > > > Hi > > > > > > First of all i'm using an old version of pig, the one that ran on > > > hadoop 12.1, and yes i will upgrade soon... > > > > > > Following I have some requests/questions, based on the use > > of Pig so far: > > > > > > 1: If you have 1 billion files (purposely exaggerating) > > where apx 50 % > > > of the files are related to one segment and 50 % to another > > segment, > > > then i guess the pig script for isolating the segments would be > > > something like following: > > > > > > files = LOAD 'path/to/1_billion_files' AS (segment); > > sementA = FILTER > > > files BY (segment='a'); sementB = FILTER files BY (segment='b'); > > > > > > STORE segmentA into 'segemtA.dat'; > > > STORE segmentB into 'segemtB.dat'; > > > > > > So the question is, are all 1 billion files filtered and > > read twice? > > > If so (guess it is), would it be possible to do something like this > > > (just to avoid the overhead of 1 billion reads): > > > > > > STORE SPLIT segmentA into 'segemtA.dat', segmentB into > > 'segemtB.dat'; > > > > > Yes, currently all 1B files are read and filtered twice. No, > > your split suggestion won't work, yet. Right now pig views > > all jobs as a tree of operations, with a given store (or > > dump) command as a root. To do what you want we need to view > > the commands as a graph, with multiple heads, which it can > > evaluate simultaneously. We're working in that direction but > > it will be a while before we're there. > > > 2: Would it be possible to allow the use of asterisks in the load > > > method of Pig. > > > > > > files = LOAD 'batches/*/batch/*/segments' > > > > > The latest versions of pig use hadoop pattern matching in > > their files, so the above commands would work. > > > 3: Allowing Userdefined hadoop job names when 'execution' a > > script, i > > > have a feeling that this one is in the newest version, true? > > > > > We don't yet allow users to define their job names, but we > > certainly have had requests to do so. > > > Appreciate any comments anyone might have, thanks :-) > > > > > > Br Casper > > > > > > > > Alan. > > >