In this case, why wouldn't you simply use globbing in your load statements?
Somethign like

baidu = LOAD 'input/path/*baidu*' AS (schema);
google = LOAD 'input/path/*google*' AS (schema);

Store baidu INTO 'output/path/baidu_all';
Store google INTO 'output/path/google_all';

On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]>wrote:

> Actually I was using another loader and I just tried with PigStorage (Pig
> 0.6) and it seems to work too.
>
> If your input file has two columns this will have the expected schema and
> data:
>
> A = load 'file' USING MyLoader() AS (f1:chararray,
> f2:chararray, fileName:chararray);
>
> A: {f1: chararray,f2: chararray,filename: chararray}
>
> If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
> will be null.
>
> So in practice the loader loads the data "independently" and then "casts"
> it
> to the schema you provided. After yes, I don't say that it is a very clean
> solution.
>
> Thanks,
>
> Romain
>
> 2010/3/2 Mridul Muralidharan <[email protected]>
>
> >
> > I am not sure if this will work as you expect.
> > Depending on which implementation of PigStorage you end up using, it
> > might exhibit different behavior.
> >
> > If I am not wrong, currently, for example, if you specify something like
> :
> >
> > A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
> > fileName:chararray);
> >
> >
> > your code will end up generating a tuple of 4 fields - the fileName
> > always being 'null' and the actual filename you inserted through
> > MyLoader ending up being the 4th field (and so not 'seen' by pig - not
> > sure what happens if you do a join, etc with this tuple though !
> > Essentially runtime is not consistent with script schema).
> >
> >
> > Note - this is an implementation specific behavior, which could probably
> > have been fixed by implementation specific hack
> > "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
> > the last field expected].
> >
> > As expected, it is brittle code.
> >
> >
> > From a while back, I remember facing issues with pig's implicit
> > conversion to/from bytearray, its implicit project which was introduced,
> > insertion of null's to extend to schema specified (the above behavior),
> > etc.
> > So you would become dependent on the impl changes.
> >
> >
> > I dont think BinStorage and PigStorage have been written with
> > inheritance in mind ...
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> >
> >
> > On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
> > > Hi,
> > >
> > > In Pig 0.6 you can extend the PigStorage and grab the name of the file
> > with
> > > something like this:
> > >
> > >    @Override
> > >    public void bindTo(String fileName, BufferedPositionedInputStream
> is,
> > long
> > > offset, long end)
> > >        throws IOException {
> > >      super.bindTo(fileName, is, offset, end);
> > >
> > >      this.fileName = fileName; // In your case match with a regexp and
> > get
> > > the group with the name only (e.g. google, baidu)
> > >    }
> > >
> > >    @Override
> > >    public Tuple getNext() throws IOException {
> > >      Tuple next = super.getNext();
> > >
> > >      if (next != null) {
> > >        next.append(fileName);
> > >      }
> > >
> > >      return next;
> > >    }
> > >
> > > Then you can group on the name and split on it.
> > >
> > > Thanks,
> > >
> > > Romain
> > >
> > > On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>  wrote:
> > >
> > >> Hi,
> > >> Could pig recognize files name are importing ? If could, how to do ? I
> > want
> > >> to combine them according filename.
> > >>
> > >> Exp:
> > >> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
> > >> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv, ....
> > >>
> > >> Sort and combine by name, then output two files:  google_all.csv,
> > >> baidu_all.csv  in a pig script.
> > >>
> > >>
> > >> Best Regards,
> > >> Jumping Qu
> > >>
> > >> ------
> > >> Don't tell me how many enemies we have, but where they are!
> > >> (ADV:Perl -- It's like Java, only it lets you deliver on time and
> under
> > >> budget.)
> > >>
> >
> >
>



-- 
Zaki Rahaman

Reply via email to