Or you can just call the script twice with: $INPUT= 'input/path/*baidu*' $OUTPUT='output/path/baidu_all'
then $INPUT= 'input/path/*google*' $OUTPUT='output/path/google_all' Thanks, Romain On Wed, Mar 3, 2010 at 5:58 PM, Zaki Rahaman <[email protected]> wrote: > Even if you're using amazon elastic mapreduce you can specify additional > named parameters when running scripts. You can specify variable placeholders > in your script and then pass them on the console. Or specify defaults. Or > you can always run your scripts in interactive mode so you have complete > control over execution. And you can always hardcode when all else fails > > Sent from my iPhone > > > On Mar 3, 2010, at 8:45 PM, Jumping <[email protected]> wrote: > > I am using MapReduce on Amazon, there is another problem, like as how to >> use two "$INPUT" parameters in a pig script. >> >> Best Regards, >> Jumping Qu >> >> ------ >> Don't tell me how many enemies we have, but where they are! >> (ADV:Perl -- It's like Java, only it lets you deliver on time and under >> budget.) >> >> >> On Thu, Mar 4, 2010 at 9:28 AM, Zaki Rahaman <[email protected]> >> wrote: >> >> Just curious, >>> >>> What solution did you use? >>> >>> Sent from my iPhone >>> >>> >>> On Mar 3, 2010, at 8:06 PM, Jumping <[email protected]> wrote: >>> >>> Thanks all of you guys. >>> >>>> >>>> >>>> Best Regards, >>>> Jumping Qu >>>> >>>> ------ >>>> Don't tell me how many enemies we have, but where they are! >>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and under >>>> budget.) >>>> >>>> >>>> On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <[email protected]> >>>> wrote: >>>> >>>> In this case, why wouldn't you simply use globbing in your load >>>> >>>>> statements? >>>>> Somethign like >>>>> >>>>> baidu = LOAD 'input/path/*baidu*' AS (schema); >>>>> google = LOAD 'input/path/*google*' AS (schema); >>>>> >>>>> Store baidu INTO 'output/path/baidu_all'; >>>>> Store google INTO 'output/path/google_all'; >>>>> >>>>> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected] >>>>> >>>>> wrote: >>>>>> >>>>>> >>>>> Actually I was using another loader and I just tried with PigStorage >>>>> >>>>>> (Pig >>>>>> 0.6) and it seems to work too. >>>>>> >>>>>> If your input file has two columns this will have the expected schema >>>>>> and >>>>>> data: >>>>>> >>>>>> A = load 'file' USING MyLoader() AS (f1:chararray, >>>>>> f2:chararray, fileName:chararray); >>>>>> >>>>>> A: {f1: chararray,f2: chararray,filename: chararray} >>>>>> >>>>>> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third >>>>>> column >>>>>> will be null. >>>>>> >>>>>> So in practice the loader loads the data "independently" and then >>>>>> "casts" >>>>>> it >>>>>> to the schema you provided. After yes, I don't say that it is a very >>>>>> >>>>>> clean >>>>> >>>>> solution. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Romain >>>>>> >>>>>> 2010/3/2 Mridul Muralidharan <[email protected]> >>>>>> >>>>>> >>>>>> I am not sure if this will work as you expect. >>>>>>> Depending on which implementation of PigStorage you end up using, it >>>>>>> might exhibit different behavior. >>>>>>> >>>>>>> If I am not wrong, currently, for example, if you specify something >>>>>>> >>>>>>> like >>>>>> >>>>> >>>>> : >>>>>> >>>>>> >>>>>>> A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray, >>>>>>> fileName:chararray); >>>>>>> >>>>>>> >>>>>>> your code will end up generating a tuple of 4 fields - the fileName >>>>>>> always being 'null' and the actual filename you inserted through >>>>>>> MyLoader ending up being the 4th field (and so not 'seen' by pig - >>>>>>> not >>>>>>> sure what happens if you do a join, etc with this tuple though ! >>>>>>> Essentially runtime is not consistent with script schema). >>>>>>> >>>>>>> >>>>>>> Note - this is an implementation specific behavior, which could >>>>>>> >>>>>>> probably >>>>>> >>>>> >>>>> have been fixed by implementation specific hack >>>>>> >>>>>>> "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is >>>>>>> the last field expected]. >>>>>>> >>>>>>> As expected, it is brittle code. >>>>>>> >>>>>>> >>>>>>> From a while back, I remember facing issues with pig's implicit >>>>>>> conversion to/from bytearray, its implicit project which was >>>>>>> >>>>>>> introduced, >>>>>> >>>>> >>>>> insertion of null's to extend to schema specified (the above >>>>>> behavior), >>>>>> >>>>>>> etc. >>>>>>> So you would become dependent on the impl changes. >>>>>>> >>>>>>> >>>>>>> I dont think BinStorage and PigStorage have been written with >>>>>>> inheritance in mind ... >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Mridul >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote: >>>>>>> >>>>>>> Hi, >>>>>>>> >>>>>>>> In Pig 0.6 you can extend the PigStorage and grab the name of the >>>>>>>> >>>>>>>> file >>>>>>> >>>>>> >>>>> with >>>>>> >>>>>>> >>>>>>> something like this: >>>>>>>> >>>>>>>> @Override >>>>>>>> public void bindTo(String fileName, BufferedPositionedInputStream >>>>>>>> >>>>>>>> is, >>>>>>> >>>>>> >>>>>> long >>>>>>> >>>>>>> offset, long end) >>>>>>>> throws IOException { >>>>>>>> super.bindTo(fileName, is, offset, end); >>>>>>>> >>>>>>>> this.fileName = fileName; // In your case match with a regexp >>>>>>>> >>>>>>>> and >>>>>>> >>>>>> >>>>> get >>>>>> >>>>>>> >>>>>>> the group with the name only (e.g. google, baidu) >>>>>>>> } >>>>>>>> >>>>>>>> @Override >>>>>>>> public Tuple getNext() throws IOException { >>>>>>>> Tuple next = super.getNext(); >>>>>>>> >>>>>>>> if (next != null) { >>>>>>>> next.append(fileName); >>>>>>>> } >>>>>>>> >>>>>>>> return next; >>>>>>>> } >>>>>>>> >>>>>>>> Then you can group on the name and split on it. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Romain >>>>>>>> >>>>>>>> On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>> >>>>>> >>>>> >>>>>> Hi, >>>>>>>> >>>>>>>>> Could pig recognize files name are importing ? If could, how to do >>>>>>>>> ? >>>>>>>>> >>>>>>>>> I >>>>>>>> >>>>>>> >>>>> want >>>>>> >>>>>>> >>>>>>> to combine them according filename. >>>>>>>> >>>>>>>>> >>>>>>>>> Exp: >>>>>>>>> google_2009_12_21.csv, google_2010_01_21.csv, >>>>>>>>> google_2010_02_21.csv, >>>>>>>>> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, >>>>>>>>> >>>>>>>>> .... >>>>>>>> >>>>>>> >>>>> >>>>>> Sort and combine by name, then output two files: google_all.csv, >>>>>>>>> baidu_all.csv in a pig script. >>>>>>>>> >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> Jumping Qu >>>>>>>>> >>>>>>>>> ------ >>>>>>>>> Don't tell me how many enemies we have, but where they are! >>>>>>>>> (ADV:Perl -- It's like Java, only it lets you deliver on time and >>>>>>>>> >>>>>>>>> under >>>>>>>> >>>>>>> >>>>>> budget.) >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Zaki Rahaman >>>>> >>>>> >>>>>
