In this case, why wouldn't you simply use globbing in your load statements? Somethign like
baidu = LOAD 'input/path/*baidu*' AS (schema); google = LOAD 'input/path/*google*' AS (schema); Store baidu INTO 'output/path/baidu_all'; Store google INTO 'output/path/google_all'; On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]>wrote: > Actually I was using another loader and I just tried with PigStorage (Pig > 0.6) and it seems to work too. > > If your input file has two columns this will have the expected schema and > data: > > A = load 'file' USING MyLoader() AS (f1:chararray, > f2:chararray, fileName:chararray); > > A: {f1: chararray,f2: chararray,filename: chararray} > > If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column > will be null. > > So in practice the loader loads the data "independently" and then "casts" > it > to the schema you provided. After yes, I don't say that it is a very clean > solution. > > Thanks, > > Romain > > 2010/3/2 Mridul Muralidharan <[email protected]> > > > > > I am not sure if this will work as you expect. > > Depending on which implementation of PigStorage you end up using, it > > might exhibit different behavior. > > > > If I am not wrong, currently, for example, if you specify something like > : > > > > A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray, > > fileName:chararray); > > > > > > your code will end up generating a tuple of 4 fields - the fileName > > always being 'null' and the actual filename you inserted through > > MyLoader ending up being the 4th field (and so not 'seen' by pig - not > > sure what happens if you do a join, etc with this tuple though ! > > Essentially runtime is not consistent with script schema). > > > > > > Note - this is an implementation specific behavior, which could probably > > have been fixed by implementation specific hack > > "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is > > the last field expected]. > > > > As expected, it is brittle code. > > > > > > From a while back, I remember facing issues with pig's implicit > > conversion to/from bytearray, its implicit project which was introduced, > > insertion of null's to extend to schema specified (the above behavior), > > etc. > > So you would become dependent on the impl changes. > > > > > > I dont think BinStorage and PigStorage have been written with > > inheritance in mind ... > > > > > > Regards, > > Mridul > > > > > > > > > > > > On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote: > > > Hi, > > > > > > In Pig 0.6 you can extend the PigStorage and grab the name of the file > > with > > > something like this: > > > > > > @Override > > > public void bindTo(String fileName, BufferedPositionedInputStream > is, > > long > > > offset, long end) > > > throws IOException { > > > super.bindTo(fileName, is, offset, end); > > > > > > this.fileName = fileName; // In your case match with a regexp and > > get > > > the group with the name only (e.g. google, baidu) > > > } > > > > > > @Override > > > public Tuple getNext() throws IOException { > > > Tuple next = super.getNext(); > > > > > > if (next != null) { > > > next.append(fileName); > > > } > > > > > > return next; > > > } > > > > > > Then you can group on the name and split on it. > > > > > > Thanks, > > > > > > Romain > > > > > > On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]> wrote: > > > > > >> Hi, > > >> Could pig recognize files name are importing ? If could, how to do ? I > > want > > >> to combine them according filename. > > >> > > >> Exp: > > >> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv, > > >> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, .... > > >> > > >> Sort and combine by name, then output two files: google_all.csv, > > >> baidu_all.csv in a pig script. > > >> > > >> > > >> Best Regards, > > >> Jumping Qu > > >> > > >> ------ > > >> Don't tell me how many enemies we have, but where they are! > > >> (ADV:Perl -- It's like Java, only it lets you deliver on time and > under > > >> budget.) > > >> > > > > > -- Zaki Rahaman
