Re: Could identify file name？

Jumping Wed, 03 Mar 2010 17:06:51 -0800

Thanks all of you guys.


Best Regards,
Jumping Qu

------
Don't tell me how many enemies we have, but where they are!
(ADV:Perl -- It's like Java, only it lets you deliver on time and under
budget.)


On Thu, Mar 4, 2010 at 3:12 AM, zaki rahaman <[email protected]> wrote:

> In this case, why wouldn't you simply use globbing in your load statements?
> Somethign like
>
> baidu = LOAD 'input/path/*baidu*' AS (schema);
> google = LOAD 'input/path/*google*' AS (schema);
>
> Store baidu INTO 'output/path/baidu_all';
> Store google INTO 'output/path/google_all';
>
> On Wed, Mar 3, 2010 at 1:21 PM, Romain Rigaux <[email protected]
> >wrote:
>
> > Actually I was using another loader and I just tried with PigStorage (Pig
> > 0.6) and it seems to work too.
> >
> > If your input file has two columns this will have the expected schema and
> > data:
> >
> > A = load 'file' USING MyLoader() AS (f1:chararray,
> > f2:chararray, fileName:chararray);
> >
> > A: {f1: chararray,f2: chararray,filename: chararray}
> >
> > If you do "tuple.set(tuple.getLength() - 1, fileName)" your third column
> > will be null.
> >
> > So in practice the loader loads the data "independently" and then "casts"
> > it
> > to the schema you provided. After yes, I don't say that it is a very
> clean
> > solution.
> >
> > Thanks,
> >
> > Romain
> >
> > 2010/3/2 Mridul Muralidharan <[email protected]>
> >
> > >
> > > I am not sure if this will work as you expect.
> > > Depending on which implementation of PigStorage you end up using, it
> > > might exhibit different behavior.
> > >
> > > If I am not wrong, currently, for example, if you specify something
> like
> > :
> > >
> > > A = load 'file' USING MyLoader() AS (f1:chararray, f2:chararray,
> > > fileName:chararray);
> > >
> > >
> > > your code will end up generating a tuple of 4 fields - the fileName
> > > always being 'null' and the actual filename you inserted through
> > > MyLoader ending up being the 4th field (and so not 'seen' by pig - not
> > > sure what happens if you do a join, etc with this tuple though !
> > > Essentially runtime is not consistent with script schema).
> > >
> > >
> > > Note - this is an implementation specific behavior, which could
> probably
> > > have been fixed by implementation specific hack
> > > "tuple.set(tuple.getLength() - 1, fileName)" [if you know fileName is
> > > the last field expected].
> > >
> > > As expected, it is brittle code.
> > >
> > >
> > > From a while back, I remember facing issues with pig's implicit
> > > conversion to/from bytearray, its implicit project which was
> introduced,
> > > insertion of null's to extend to schema specified (the above behavior),
> > > etc.
> > > So you would become dependent on the impl changes.
> > >
> > >
> > > I dont think BinStorage and PigStorage have been written with
> > > inheritance in mind ...
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > >
> > >
> > >
> > > On Wednesday 03 March 2010 12:28 AM, Romain Rigaux wrote:
> > > > Hi,
> > > >
> > > > In Pig 0.6 you can extend the PigStorage and grab the name of the
> file
> > > with
> > > > something like this:
> > > >
> > > >    @Override
> > > >    public void bindTo(String fileName, BufferedPositionedInputStream
> > is,
> > > long
> > > > offset, long end)
> > > >        throws IOException {
> > > >      super.bindTo(fileName, is, offset, end);
> > > >
> > > >      this.fileName = fileName; // In your case match with a regexp
> and
> > > get
> > > > the group with the name only (e.g. google, baidu)
> > > >    }
> > > >
> > > >    @Override
> > > >    public Tuple getNext() throws IOException {
> > > >      Tuple next = super.getNext();
> > > >
> > > >      if (next != null) {
> > > >        next.append(fileName);
> > > >      }
> > > >
> > > >      return next;
> > > >    }
> > > >
> > > > Then you can group on the name and split on it.
> > > >
> > > > Thanks,
> > > >
> > > > Romain
> > > >
> > > > On Mon, Mar 1, 2010 at 3:09 AM, Jumping<[email protected]>
>  wrote:
> > > >
> > > >> Hi,
> > > >> Could pig recognize files name are importing ? If could, how to do ?
> I
> > > want
> > > >> to combine them according filename.
> > > >>
> > > >> Exp:
> > > >> google_2009_12_21.csv, google_2010_01_21.csv, google_2010_02_21.csv,
> > > >> baidu_2009_11_22.csv, baidu_2010_01_01.csv, baidu_2010_02_03.csv,
> ....
> > > >>
> > > >> Sort and combine by name, then output two files:  google_all.csv,
> > > >> baidu_all.csv  in a pig script.
> > > >>
> > > >>
> > > >> Best Regards,
> > > >> Jumping Qu
> > > >>
> > > >> ------
> > > >> Don't tell me how many enemies we have, but where they are!
> > > >> (ADV:Perl -- It's like Java, only it lets you deliver on time and
> > under
> > > >> budget.)
> > > >>
> > >
> > >
> >
>
>
>
> --
> Zaki Rahaman
>

Re: Could identify file name？

Reply via email to