I am using MapReduce on Amazon, there is another problem, like as how to use two "$INPUT" parameters in a pig script.
Best Regards, Jumping Qu ------ Don't tell me how many enemies we have, but where they are! (ADV:Perl -- It's like Java, only it lets you deliver on time and under budget.) On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <[email protected]> wrote: > Just curious, > > What solution did you use? > > Sent from my iPhone > > > On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote: > > Thanks all of you guys. >> >> >> Best Regards, >> Jumping Qu >> >> ------ >> Don't tell me how many enemies we have, but where they are! >> (ADV:Perl -- It's like Java, only it lets you deliver on time and under >> budget.) >> >> >> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <[email protected]> >> wrote: >> >> In this case, why wouldn't you simply use globbing in your load >>> statements? >>> Somethign like >>> >>> baidu = LOAD 'input/path/*baidu*' AS (schema); >>> google = LOAD 'input/path/*google*' AS (schema); >>> >>> Store baidu INTO 'output/path/baidu_all'; >>> Store google INTO 'output/path/google_all'; >>> >>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected] >>> >>>> wrote: >>>> >>> >>> Actually I was using another loader and I just tried with PigStorage >>>> (Pig >>>> 0.6) and it seems to work too. >>>> >>>> If your input file has two columns this will have the expected schema >>>> and >>>> data: >>>> >>>> A = load 'file' USING MyLoader() AS (f1:chararray, >>>> f2:chararray, fileName:chararray); >>>> >>>> A: {f1: chararray,f2: chararray,filename: chararray} >>>> >>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column >>>> will be null. >>>> >>>> So in practice the loader loads the data "independently" and then >>>> "casts" >>>> it >>>> to the schema you provided. After yes, I don't say that it is a very >>>> >>> clean >>> >>>> solution. >>>> >>>> Thanks, >>>> >>>> Romain >>>> >>>> 2010/3/2 Mridul Muralidharan <[email protected]> >>>> >>>> >>>>> I am not sure if this will work as you expect. >>>>> Depending on which implementation of PigStorage you end up using, it >>>>> might exhibit different behavior. >>>>> >>>>> If I am not wrong, currently, for example, if you specify something >>>>> >>>> like >>> >>>> : >>>> >>>>> >>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray, >>>>> fileName:chararray); >>>>> >>>>> >>>>> your code will end up generating a tuple of 4 fields - the fileName >>>>> always being 'null' and the actual filename you inserted through >>>>> MyLoader ending up being the 4th field (and so not 'seen' by pig - not >>>>> sure what happens if you do a join, etc with this tuple though ! >>>>> Essentially runtime is not consistent with script schema). >>>>> >>>>> >>>>> Note - this is an implementation specific behavior, which could >>>>> >>>> probably >>> >>>> have been fixed by implementation specific hack >>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is >>>>> the last field expected]. >>>>> >>>>> As expected, it is brittle code. >>>>> >>>>> >>>>> From a while back, I remember facing issues with pig's implicit >>>>> conversion to/from bytearray, its implicit project which was >>>>> >>>> introduced, >>> >>>> insertion of null's to extend to schema specified (the above behavior), >>>>> etc. >>>>> So you would become dependent on the impl changes. >>>>> >>>>> >>>>> I dont think BinStorage and PigStorage have been written with >>>>> inheritance in mind ... >>>>> >>>>> >>>>> Regards, >>>>> Mridul >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of the >>>>>> >>>>> file >>> >>>> with >>>>> >>>>>> something like this: >>>>>> >>>>>> @Override >>>>>> public void bindTo(String fileName, BufferedPositionedInputStream >>>>>> >>>>> is, >>>> >>>>> long >>>>> >>>>>> offset, long end) >>>>>> throws IOException { >>>>>> super.bindTo(fileName, is, offset, end); >>>>>> >>>>>> this.fileName = fileName; // In your case match with a regexp >>>>>> >>>>> and >>> >>>> get >>>>> >>>>>> the group with the name only (e.g. google, baidu) >>>>>> } >>>>>> >>>>>> @Override >>>>>> public Tuple getNext() throws IOException { >>>>>> Tuple next = super.getNext(); >>>>>> >>>>>> if (next != null) { >>>>>> next.append(fileName); >>>>>> } >>>>>> >>>>>> return next; >>>>>> } >>>>>> >>>>>> Then you can group on the name and split on it. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Romain >>>>>> >>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]> >>>>>> >>>>> wrote: >>> >>>> >>>>>> Hi, >>>>>>> Could pig recognize files name are importing ? If could, how to do ? >>>>>>> >>>>>> I >>> >>>> want >>>>> >>>>>> to combine them according filename. >>>>>>> >>>>>>> Exp: >>>>>>> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv, >>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, >>>>>>> >>>>>> .... >>> >>>> >>>>>>> Sort and combine by name, then output two files: google_all.csv, >>>>>>> baidu_all.csv in a pig script. >>>>>>> >>>>>>> >>>>>>> Best Regards, >>>>>>> Jumping Qu >>>>>>> >>>>>>> ------ >>>>>>> Don't tell me how many enemies we have, but where they are! >>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and >>>>>>> >>>>>> under >>>> >>>>> budget.) >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Zaki Rahaman >>> >>>
